Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning

Extracting meaningful entities belonging to predefined categories from Visually-rich Form-like Documents (VFDs) is a challenging task. Visual and layout features such as font, background, color, and bounding box location and size provide important cues for identifying entities of the same type. However, existing models commonly train a visual encoder with weak cross-modal supervision signals, resulting in a limited capacity to capture these non-textual features and suboptimal performance. In this paper, we propose a novel \textbf{V}isually-\textbf{A}symmetric co\textbf{N}sisten\textbf{C}y \textbf{L}earning (\textsc{Vancl}) approach that addresses the above limitation by enhancing the model's ability to capture fine-grained visual and layout features through the incorporation of color priors. Experimental results on benchmark datasets show that our approach substantially outperforms the strong LayoutLM series baseline, demonstrating the effectiveness of our approach. Additionally, we investigate the effects of different color schemes on our approach, providing insights for optimizing model performance. We believe our work will inspire future research on multimodal information extraction.


Introduction
Information extraction (IE) for visually-rich formlike documents (VFDs) aims to handle various types of business documents, such as invoices, receipts, and forms, that may be scanned or digitally generated.This task has attracted significant attention from the research and industrial communities (Xu et al., 2020;Garncarek et al., 2021;Gu et al., 2021;Li et al., 2021a,b).As shown in Figure 1, the goal of IE for VFDs is to identify and extract meaningful semantic entities, such as company/person names, dates/times, and contact information, from serialized OCR (Optical Character Recognition) output in the documents.Since a single modality may not capture all the semantic information are we advertising?, • To Whom are we taking?… • SPECIFICS, • OBJECTIVE, • TARGET, • COMMUNICATION PLATFORM present in the document, it is necessary to leverage multimodal information, including text, spatial, and visual data.Therefore, large-scale pretrained Multimodal Models (LMMs) (Gu et al., 2021;Li et al., 2021a;Appalaraju et al., 2021;Xu et al., 2021;Huang et al., 2022;Wang et al., 2022a;Lee et al., 2022), which are models that can process multiple modalities of data, have emerged as the dominant approach in IE for VFDs in recent years.State-ofthe-art LMMs integrate advanced computer vision models (Ren et al., 2015;He et al., 2016) within BERT-like architectures (Devlin et al., 2019) to leverage spatial and visual information along with text and learn multimodal fused representations for form-like documents.However, these representations are biased toward textual and spatial modalities (Cooney et al., 2023) and have limited per-formance, especially when the data contains richer visual information.This is because the visual encoder in these models usually plays a secondary role compared to advanced text encoders.
There are two problems with the visual encoder in previous LMMs.First, these models only impose coarse-grained cross-modal constraints during pretraining (e.g., text-image, word-patch, and layouttext alignment (Xu et al., 2021;Huang et al., 2022;Wang et al., 2022a)) to enhance feature extraction from the visual channel, but this does not capture sufficient fine-grained visual features, leading to limited performance and underutilization of prior knowledge in vision.Second, the visual encoders in previous LMMs have weaker representational capabilities than those in the latest Optical Character Recognition (OCR) engines because they do not consider the intermediate tasks such as text segment detection and bounding box regression, which are important for accurately localizing and extracting fine-grained visual features.
To address the issues mentioned above, inspired by recent works on consistency learning (Zhang et al., 2018;Miyato et al., 2019;Xie et al., 2020;Lowell et al., 2021a;Liang et al., 2021;Chen et al., 2021b), we propose a novel vision-enhanced training approach, called Visually-Asymmetric coNsistenCy Learning (VANCL).By incorporating color priors with category-wise colors as additional cues to capture visual and layout features, VANCL can enhance the learning of unbiased multimodal representations in LMMs.Our approach aims to jointly consider the training objectives of the text segment (or bounding box) detection task and the entity type prediction objective, thereby bridging the preprocessing OCR stage with the downstream information extraction stage.
VANCL involves a standard learning flow and an extra vision-enhanced flow.The inputs are asymmetric in the visual modality, with the original document images input to the standard flow, while images to the vision-enhanced flow are the synthetic painted images.The vision-enhanced flow also can be detached when deploying the model.During training, we encourage the inner visual encoder to be as strong as the outer visual encoder via consistency learning.As a result, VANCL outperforms existing methods while (1) extremely little manual effort to create synthetic painted images, (2) no need to train from scratch, whilst (3) no increase in the deployment model size.
Our contributions can be summarized as follows: • We propose a novel consistency learning approach using visually-asymmetric inputs, called VANCL, which effectively incorporates prior visual knowledge into multimodal representation learning.
• We demonstrate the effectiveness of VANCL by applying it to the task of form document information extraction using different LMM backbones.Experimental results show that the improvements using VANCL are substantial and independent of the backbone used.
• We investigate how different color schemes affect performance, and the findings are consistent with cognitive psychology theory.

Problem Formulation
We treat the Semantic Entity Recognition (SER) task as a sequential tagging problem.Given a document D consisting of a scanned image I and a list of text segments within OCR bounding boxes B = {b 1 , . . ., b N }, we formulate the problem as finding an extraction function F IE (D : ⟨B, I⟩) → E that predicts the corresponding entity types for each token in D. The predicted sequence of labels E is obtained using the "BIO" scheme -{Begin, Inside, Other} and a pre-defined label set.We train our sequential tagging model based on pretrained LMMs and perform Viterbi decoding during testing to predict the token labels.Each bounding box b i contains a set of M tokens transcribed by an OCR engine and coordinates defined as (x 1 i , x 2 i ) for the left and right horizontal coordinates and (y 1 i , y 2 i ) for the top and bottom vertical coordinates.

Approach
Our model architecture for visually-asymmetric consistency learning is illustrated in Figure 2. Inspired by mutual learning (Zhang et al., 2018), we start with a standard learning flow and an extra vision-enhanced flow, which are learned simultaneously to transfer knowledge from the visionenhanced flow to the standard learning flow.It is worth noting that the input images for the visionenhanced flow are colorful prompt paints, while the input images for the standard flow are original images.Therefore, the information contained in the visual inputs to the vision-enhanced flow and the standard model is asymmetric.3

Semantic Entity Recognizer
where P Y and P Ỹ are the predicted probability distributions output by the standard learning flow and the extra vision-enhanced flow, which are the latent outputs after softmax normalization (i.e., soft label).Note that the inputs for two networks are different, namely X and X, the former is the original document image, and the latter is the synthetic document image (with additional color patches).
The training objective contains two parts: supervision loss L sup and consistency loss L cons .The supervision losses are formulated using the standard cross-entropy loss on the annotated images as follows: where L ce is the cross-entropy loss function and y * is the ground truth.P (y|⟨B k , I k ⟩; Θ) and P (ỹ|⟨B k , Ĩk ⟩; Θ, Θ v ) refer to the corresponding predicted probability distributions of standard and vision-enhanced models, respectively.B k , I k denote the bounding box position information and the original image of the k-th document.Ĩk refers to the synthetic image with color patches.
The consistency loss defines the proximity between two predictions.During training, there exists inconsistency due to asymmetric information in the inputs to the standard learning flow and the visionenhanced flow.Concretely, it is necessary to penalize the gap between two soft label signals (i.e., the prediction distributions) generated by the standard and vision-enhanced flows.The consistency loss is computed as: where Q is a distance function that measures the divergence between the two distributions.
The final training objective for visuallyasymmetric consistency learning is written as: where λ is a hyperparameter for the trade-off weight.The above loss function takes into account the consistency between hard and soft labels, which also reduces the overconfidence of the model.

Painting with colors
Visual-text alignment is essential for learning multimodal representations, but fine-grained alignment at the bounding box level has not been adequately captured by previous models.Therefore, it is imperative to explore methods for bridging the gap between text segment (or bounding box) detection and entity classification tasks.
One natural solution is to integrate label information using colors, which could effectively enhance visual-text alignment by representing entity-type information with color patches.However, the manual creation of these visual prompts would be extremely time-consuming and laborious.To tackle this problem, we adopt a simple and ingenious process that uses OCR bounding box coordinates to automatically paint the bounding boxes with colors in the original image copies.
Let D k denote the k-th document, consisting of ⟨B k , I k ⟩.First, we make an image copy I ′ k for each training instance of VFD.Then, we paint the bounding boxes in the image copy with the colors responding to entity types according to the coordinates Hence, we obtain a new image Ĩk with color patches after painting.This process can be represented as follows: where ROIAlign obtains fine-grained image areas corresponding to bounding boxes within the region of interest (here refers to B k ).Ĩk is the newly generated synthetic image that preserves the original visual information as well as the bounding box priors with label information.We use these prompt paints to train the extra vision-enhanced flow.

Dual-flow training
Avoiding label leakage becomes a major concern in this task when directly training backbone LMMs with prompt paints using only the standard flow.
Fortunately, the dual-flow architecture used in our model not only allows for detaching the visionenhanced flow and discarding prompt paints when testing but also enables the use of arbitrary backbone LMMs and any available visual encoders in the outer visual channel.This strategy avoids label leakage and enhances the visual feature-capturing ability of the original LMM through a consistency learning-based strategy similar to adversarial learn-ing.These are exciting and interesting findings in this work.

Backbone networks
We conduct our experiments using LayoutLM series backbones, a family of transformer-based well-pertained LMMs specifically designed for visually-rich document understanding tasks.LAYOUTLM Vanilla LayoutLM model LAYOUTLM-BASE-UNCASED (Xu et al., 2020) pretrained only using text and layout information.LAYOUTLM w/ img Given the vanilla LayoutLM model does not utilize the visual information, we integrate LayoutLM with ResNet-101 (He et al., 2016) to enable the capability of feature extraction from the visual channel1 .LAYOUTLMV2/LAYOUTLMV3 We also make comparisons by substituting the backbone with LAYOUTLMV2-BASE-UNCASED (Xu et al., 2021) and LAYOUTLMV3-BASE (Huang et al., 2022).VANCL (this work) We initialize the VANCL from existing pretrained LayoutLM backbones and share the parameter weights of the LayoutLMs in the standard learning and vision-enhanced flows, including the text encoder (e.g., BERT), the inner visual encoder (e.g., ResNet, ResNeXt, and Linear), the position encoders, the Transformer layers, and the classifier.For the outer visual encoder, we use ResNet-101 as the default and also test non-pretrained CNN and ResNet.

Experimental setups
We train all models based on the default parameters using the Adam optimizer with γ of (0.9, 0.99) without warmup.Our models are set to the same learning rate of 5e − 5.For the FUNSD and SROIE datasets, we set the Dropout to 0.3 to prevent model overfitting, while we reduce the Dropout to 0.1 for the SEABILL dataset.We set the training batch size to 8 and train all models on an NVIDIA RTX3090 GPU.We trained our models for 20 iterations to reach convergence and achieve more stable performance.
5 Result Analysis

Overall evaluation
Table 2 shows the scores of precision, recall, and F1 on the form document information extraction task.Our models outperform the baseline models in terms of both LayoutLM (a model that only considers text and spatial features) and LayoutLM w/ img/LayoutLMv2/LayoutLMv3 (models that also consider visual features).It should be noted that LayoutLM w/ img incorporates visual features after the spatial-aware multimodal transformer layers, while LayoutLMv2 and LayoutLMv3 incorporate visual features before these layers.It suggests that VANCL could be easily applied to most of the ex-isting LMMs for visually-rich document analysis and information extraction tasks with little or no significant modifications to the network architecture.Please refer to Appendix A.3 for a detailed case study.

Impact of consistency loss
As for a comprehensive evaluation of each module, we conduct the additional ablation experiment by examining the impact of using consistency loss.
The experimental results in   without the driving force of consistency, the addition of color information may actually increase the learning noise.

Different divergence metrics
There are many ways to measure the gap between two distributions, and different measures can lead to different consistency loss types, which can also affect the final performance of the model on form document extraction tasks.In this section, we verify two types of consistency loss, Jensen-Shannon (JS) divergence (Lin, 1991) and Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) VANCL model outperforms the baseline in most cases, regardless of the different backbone networks and datasets, it is still worth noting that the choice of consistency loss varies depending on the characteristics of the dataset, such as key type and layout style, and whether overfitting occurs due to the complexity of the model and the data size.As shown in Table 4, different consistency losses are used for different datasets and backbone networks to achieve optimal results.For example, when using Jensen-Shannon divergence for LayoutLMv2, VANCL achieved optimal results on the SEABILL dataset, but second best on the FUNSD and SROIE datasets.

Comparison with inner/outer consistency
To demonstrate the effectiveness of our proposed method, we compare our approach versus R-Drop (Liang et al., 2021) and mutual learning (Zhang et al., 2018).R-Drop is a powerful and widely used regularized dropout approach that considers multiple model inner consistency signals.Table 5: Comparison of F1 scores with inner and outer consistencies using R-Drop (Liang et al., 2021) and mutual learning (Zhang et al., 2018).

Visualization analysis
To more intuitively demonstrate the effect of VANCL on the obtained visual encodings in the visual channel, we visualize the encoding distributions before and after VANCL training via t-SNE.
Figure 4 shows the integration of visual information in both flows.This demonstrates that VANCL indirectly transfers the label prior distribution to the standard flow by aligning the information in the standard and vision-enhanced flows, which improves the subsequent inference process.More detailed results of the overall distribution visualization can be found in Appendix A.4.

Changing the color scheme
To verify the effect of different background colors on the model, we conduct experiments on the FUNSD dataset.In addition to the standard color scheme (red, blue, yellow, and green), we conduct three additional experiments.First, we randomly swap the colors used in the standard color scheme.Second, we choose different shades or intensities of the same color system for each label class.Third, we only draw the bounding box lines with colors.
The results in Table 6 reveal three rules followed by VANCL, which are consistent with the existing findings in cognitive science (Gegenfurtner, 2003;Elliot and Maier, 2014): 1) different shades or intensities of the same color system do not significantly affect the results; 2) swapping different colors in the current color scheme does not significantly affect the results; and 3) VANCL is effective when using colors with strong contrasts while being sensitive to grayscale.Table 7: The impact of reducing the data size of the training subset on the SEABILL dataset.

Low-resource scenario
To verify the effectiveness of VANCL in lowresource scenarios, we investigate whether VANCL improves the SER task when training the model with different sizes of data.We choose LayoutLM and LayoutLMv2 as the test backbone and compare the results of VANCL with the corresponding baseline by varying the training data sizes.Specifically, we randomly draw a percentage p of the training data, where p is chosen from the set {5%, 12.5%, 25%, 50%}, from the SEABILL dataset.Results in Table 7 reveal two key observations.First, VANCL consistently outperforms the LayoutLM baselines for different sizes of training data.Second, the performance gap between VANCL and the baseline is much larger in a lowresource scenario, indicating that VANCL can boost the model training in such scenarios.

Non-pretrained testing
To verify whether VANCL's improvement is simply due to pretrained visual encoders containing realworld color priors, we deliberately tested visual encoders that were not pretrained and initialized their parameters when introduced into the model.In this way, the visual encoder would need to learn to extract visual features from scratch.Considering that deeper networks are usually more difficult to train in a short time, we choose a smaller ResNet-18 and an extremely simple two-layer CNN (Lecun et al., 1998) as the outer visual encoders for our experiments.The results in Table 8 show that even simple visual encoder networks, such as CNN-2 and ResNet-18, can still surpass the standard baseline models, indicating that the original backbone models learn better visual features through visuallyasymmetric consistency learning (VANCL).
6 Related Work

Visually-rich document understanding
To achieve the goal of understanding visually-rich documents, a natural attempt is to enrich state-ofthe-art pretrained language representations by making full use of the information from multiple modalities including text, layout and visual features.
Researchers have proposed various paradigms, such as text-based, grid-based, graph-based, and transformer-based methods.Text-based methods, e.g., XLM-RoBERT (Conneau et al., 2020), In-foXLM (Chi et al., 2021), only consider the textual modality, rely on the representation ability of largescaled pretrained language models.Grid-based methods, e.g., Chargrid (Katti et al., 2018), Bert-Grid (Denk and Reisswig, 2019), and ViBERTgrid (Lin et al., 2021), represent documents using a 2D feature map, allowing for the use of image segmentation and/or object detection models in computer vision (CV).GNN-based methods (Liu et al., 2019a;Tang et al., 2021) treat text segments in the document as nodes and model the associations among text segments using graph neural networks.

Consistency learning
In recent years, there have been significant advancements in consistency learning, which aims to reduce variances across different model predictions.
The core idea of consistency learning is to construct cross-view constraints that enhance consistency across different model predictions.Previous research on consistency learning has focused on adversarial or semi-supervised learning, which provides supervised signals that help models learn from unlabeled data (Miyato et al., 2019;Xie et al., 2020;Lowell et al., 2021a).There is also interest in incorporating consistency mechanisms into supervised learning (Chen et al., 2021a;Lowell et al., 2021b).Batra and Parikh (2017) propose cooperative learning, allowing multiple networks to learn the same semantic concept in different environments using different data sources, which is resistant to semantic drift.Zhang et al. (2018) propose a deep mutual learning strategy inspired by cooperative learning, which enables multiple networks to learn from each other.Previous works enable cross-modal learning either using the classical knowledge distillation methods (Hinton et al., 2015) or modal-deficient generative adversarial networks (Ren and Xiong, 2021).Hinton et al. (2015)'s method starts with a large, powerful pretrained teacher network and performs one-way knowledge transfer to a simple student, while Ren and Xiong (2021)'s method attempts to encourage the partial-multimodal flow as strong as the full-multimodal flow.In contrast, we explore the potential of cross-modal learning and information transfer, and enhance the model's ability to information extraction by using visual clues.This allows the model to learn a priori knowledge of entity types, which is acquired before encountering the specific task or data.In terms of model structure, our model architecture is more similar to mutual learning (Zhang et al., 2018) or cooperative learning (Batra and Parikh, 2017).

Conclusion
We present VANCL, a novel consistency learning approach to enhance layout-aware pretrained multimodal models for visually-rich form-like document information extraction.Experimental results show that VANCL successfully learns prior knowledge during the training phase, surpassing existing state-of-the-art models on benchmark datasets of varying sizes and a large-scale real-world form document dataset.VANCL exhibits superior performance against strong baselines with no increase in model size.We also provide recommendations for selecting color schemes to optimize performance.

Limitations
The limitations of our work in this paper are the following points: 1.For the sake of simplicity, we only experiment with the LayoutLM-BASE, LayoutLMv2- It can be seen that sharing the parameters of the visual encoder is better than not sharing the weights in most cases (see Table 9), and the way of sharing parameters also greatly reduces the overall number of parameters of the model (see Table 2).In addition, we also compare the results whether using the colorful prompt paints under the settings of weight sharing or not.Even under the condition that model weights are not shared, we observe that colored visual clues still significantly improve the performance of the model.

Figure 1 :
Figure 1: Motivation of this work.Semantic entities of the same type often have similar visual and layout properties, such as the same or similar font, background, color, and the location and size of bounding boxes, providing important indications for recognizing entities and their types.Despite the importance of these properties, existing LMMs for information extraction in VFDs often rely on a limited visual encoder that cannot fully capture such fine-grained features.Therefore, this work focuses on incorporating these visual priors using colors into the task of IE for VFDs.

Figure 2 :
Figure 2: The overall illustration of the VANCL framework.It encourages knowledge transfer from the extra vision-enhanced flow to the standard flow through consistency losses.
Figure  3illustrates the different structures of the three methods.

Figure 4 :
Figure 4: The t-SNE visualization of token-level visual encodings of a particular type output by the standard (red) and vision-enhanced (grayish purple) flows.

Table 1 :
Statistics of the used datasets, including the numbers of documents (Doc), bounding boxes (BD), and tokens (Token).
Table3reveal that removing the consistency loss leads to a decline in the model's performance to varying extents.This find-

Table 3 :
Ablation experiment by examining the impact of using consistency loss on different backbones.-CL means no consistency loss is used.
ing demonstrates the significance of consistency loss in the model.Simultaneously, it indicates that

Table 4 :
, on the effect of form document extraction.While the Effect of different consistency losses using JS and KL divergences on F1 scores.Bold indicates the best and underline indicates the second best.

Table 6 :
Results of using different color schemes on the FUNSD dataset.SUPP.denotes the number of supports for each entity type in the test set.

Table 9 :
Michał Pietruszka, and Gabriela Pałka.2021.Going full-tilt boogie on document understanding with text-image-layout transformer.In Document Analysis and Recognition-ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part II 16, pages 732-747.Springer.Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun.2015.Faster R-CNN: towards real-time object detection with region proposal networks.In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91-99.Yuqi Ren and Deyi Xiong.2021.CogAlign: Learning to align textual neural representations to cognitive language processing signals.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3758-3769, Online.Association for Computational Linguistics.This dataset contains 199 well-annotated scanned forms.Each semantic entity consists of a unique identifier id, a label ({Question, Answer, Header, or Other}), a bounding box, a list of links to other entities, and a list of words.SROIE The dataset contains 626 receipts for training and 347 receipts for testing.Each receipt consists of its scanned image and OCR transcription, organized as a list of text segments with bounding box position information pairs.Each receipt is labeled with four types of entities, namely {Company, Date, Address, Total}.SEABILL The dataset is a complex collection of documents derived from maritime business scenarios and consists of 3,562 training documents and 953 test documents.The data consists of PDF images and rule-based transcriptions of the documents with three labels {Question, Answer, Other}.A.2 Effect of sharing weightsIn this section, we design experiments to investigate the effect of sharing weight.As shown in Table9, we conduct experiments on three datasets by controlling the variables of the inputs and model weight sharing.Effect of sharing weights of the network parameters between the inner and outer visual encoders in LAYOUTLM (w/ img) on model F1 score.Sharing the weights of the model brings significantly better results than not sharing the weights.