DocumentNet: Bridging the Data Gap in Document Pre-Training

Document understanding tasks, in particular, Visually-rich Document Entity Retrieval (VDER), have gained significant attention in recent years thanks to their broad applications in enterprise AI. However, publicly available data have been scarce for these tasks due to strict privacy constraints and high annotation costs. To make things worse, the non-overlapping entity spaces from different datasets hinder the knowledge transfer between document types. In this paper, we propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models. The collected dataset, named DocumentNet, does not depend on specific document types or entity sets, making it universally applicable to all VDER tasks. The current DocumentNet consists of 30M documents spanning nearly 400 document types organized in a four-level ontology. Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training for both classic and few-shot learning settings. With the recent emergence of large language models (LLMs), DocumentNet provides a large data source to extend their multi-modal capabilities for VDER.


Introduction
Document understanding is one of the most errorprone and tedious tasks many people have to handle every day.Advancements in machine learning techniques have made it possible to automate such tasks.In a typical Visually-rich Document Entity Retrieval (VDER) task, pieces of information are retrieved from the document based on a set of predefined entity types, known as the schema.For example, "amount", "date", and "item name" are major parts of an invoice schema.
The current setup of VDER tasks presents several unique challenges for acquiring sufficient training data.First, the availability of raw document images is greatly limited due to privacy constraints.Real-world documents, such as a driver's license or a bank statement, often contain personally identifiable information and are subject to access controls.Second, detailed annotation is costly and typically requires intensive training for experienced human annotators.E.g., it takes deep domain knowledge to correctly label different fields in complex tax forms.Finally, knowledge sharing between various types of documents is constrained by inconsistent label spaces and contextual logic.For example, the entity sets (i.e., schema) could be mutually exclusive, or the same entity type could take different semantic meanings in different contexts.
A number of models have been proposed for VDER tasks with various success (Huang et al., 2022;Lee et al., 2022;Appalaraju et al., 2021;Gu et al., 2021).To tackle the aforementioned challenges, most prior works initialize from a language model followed by BERT-style (Devlin et al., 2019) pre-training on document datasets with additional layout and visual features.However, even the largest dataset currently in use, i.e.IIT-CDIP (Lewis et al., 2006) dataset, has a limited size and only reflects a subset of document types.

Related Work
Tab. 1 provides an overview of relevant document datasets, with more details in App.B.1.
Single-domain document datasets.Many small document datasets with entity-span annotations have been used for tasks such as entity extraction.
They contain less than 100k pages from a single domain.Newer datasets come with high-quality OCR annotation thanks to the advantage of relevant tools, while older ones, such as FUNSD (Jaume et al., 2019), often contain OCR errors.These datasets do not contain sufficient samples for the pre-training of a large model.

Large document datasets.
A few larger datasets contain over 100k pages from different domains.However, they usually do not contain OCR annotations or entity-level labels.IIT-CDIP (Lewis et al., 2006) has been the largest dataset commonly used for pre-training of document understanding models.Although these datasets are large, their image quality and annotation completeness are often unsatisfactory.To complement them, we collect high-quality document images from the Internet to build the DocumentNet datasets with rich OCR and entity annotations, and demonstrate their effectiveness in document model pre-training.
Ontology-based datasets.Large labeled datasets are usually collected following an ontology.Ima-geNet (Deng et al., 2009) for image recognition is built upon the synsets of WordNet (Miller, 1998).

DocumentNet Dataset
Blindly crawling the Web for images may seem easy, but it is not a practical solution since most images on the Web are not relevant to document types.We need a scalable pipeline to only select the concerned images.Broadly, this is achievable via a nearest-neighbor search of relevant keywords in a text-image joint embedding space.First, we design a set of query keywords, i.e., the document Financial Legal Business Education   ontology, and encode them into the embedding space of general Web images.Further, a nearestneighbor algorithm retrieves the top-K semantically closest images to each query keyword.Finally, a deduplication step consolidates all retrieved images across all query keywords.Fig. 1 illustrates several exemplar documents retrieved using our provided keywords.
Ontology creation.Each text string in the ontology list serves as a seed to retrieve the most relevant images from the general Web image pool.An ideal ontology list should therefore cover a broad spectrum of query keywords across and within the concerned downstream application domains.Although algorithmic or generative approaches may exist, in this paper, we manually curated about 400 document-related query keywords that cover domains of finance, business, personal affairs, legal affairs, tax, education, etc.The full ontology hierarchy and keyword list are provided in App.D.
Image retrieval from ontology.To retrieve only the most relevant document images out of the hundreds of billions of general Web images, we leverage a highly efficient nearest neighbor pipeline by defining the distance metric as the dot product distance between the semantic feature vectors of the image and each of the target query keywords.Here we refer to Graph-RISE (Timofeev et al., 2020)  for the semantic image embedding, and all query keywords are encoded into the same feature space as the images.Empirically, we pick the top 10k nearest neighbors for each query keyword.Note that the same image might be retrieved via multiple semantically similar keywords, so a de-duplication step is needed afterward.We summarize the main pipeline steps in Fig. 2. Fig. 3 shows statistical insights of the retrieved 30M document images with the mean and standard deviation histogram over each of the query keywords.The majority of the retrieved images are with mean distance values greater than 0.8 and standard deviations no more than 0.03, indicating high relevance to the document ontology.
OCR and annotation.The retrieved images are fed into an OCR engine to generate a text sequence in reading order.We apply a text tagging model to weakly annotate the text segments of each sequence into 6 classes, including email addresses, mail ad- dresses, prices, dates, phone numbers, and person names.Albeit noisy, these classification labels provide additional supervision for pre-training.
Post-processing and open-source tools.We adopt some heuristic-based filtering to improve sample quality.For example, we remove samples where the overall OCR result is poor due to blurry or noisy images.Some proprietary tools are used for scalable processing during the construction of DocumentNet, but open-source alternatives are readily available.
With all of the above steps, we have obtained a dataset of high-quality document images that are closely relevant to our query ontology.This dataset contains multiple modalities, including the image pixels, the OCR characters, the layout coordinates, and the segment tags.

UniFormer Model
To take advantage of all the modalities available in DocumentNet, we build a lightweight transformer model named UniFormer for document pretraining.UniFormer is built upon the BERT (Devlin et al., 2019) architecture similar to LayoutLM (Xu et al., 2020) and LayoutLMv2 (Xu et al., 2021) illustrates the pre-training pipeline.We highlight the new designs for multimodal pretraining here and defer more details into App. A.

Multimodal tokenization and embedding.
With a pre-defined text tokenizer, e.g.Word-Piece (Wu et al., 2016), we first tokenize the OCR characters into a sequence of text tokens c.For each token c i , we obtain its bounding box b i = (x 0 , y 0 , x 1 , y 1 ) i by taking the union of the bounding boxes of its characters.We enlarge the bounding box by a context ratio r on each side and obtain the corresponding visual image crop v i for each token from the raw image.To model visual information, we add a crop embedding by linearly projecting the flattened pixels in the image crop, following ViT (Dosovitskiy et al., 2020).Masked crop modeling.In addition to predicting the text token in the MMLM objective, A Uni-Former parameterized by θ also predicts the visual modality by reconstructing the image crops for the masked tokens, in a way similar to MAE (He et al., 2022).It is formulated as a regression problem with a linear layer outputing flattened pixels and the objective is where c and v denotes the masked tokens and crops according to mask M. ρ is the position and layout embeddings.
Token tagging.With fully unmasked sequences, UniFormer is pre-trained to predict the token tags t with a separate head.Since each token may have multiple tags, it is formulated as a multi-label classification problem with binary cross-entropy losses.

Experiments
We pre-train UniFormer on DocumentNet and evaluate on two settings: (1) the classic VDER setting with the full split of train and test; (2) the few-shot VDER setting where we have meta-train and metatest task sets with each task containing a set of samples that satisfies the N -way K-shot setting.

Pre-Training
We initialize our UniFormer with BERT weights using the uncased vocabulary.The models are pretrained using the Adam optimizer (Kingma and Ba, 2014).We adopt a cosine learning rate schedule with linear warmup during the first 2% steps and a peak learning rate of 10 −4 .We use 20% of the samples for the token tagging pre-training task.The models are trained for 500K steps with a batch size of 2048 on 128 TPUv3 devices.

Classic VDER Setting
We evaluate the performance of pre-trained Uni- Comparisons with state-of-the-art.We compare the performance on the three benchmarks with state-of-the-art approaches in Table 4.As shown, most prior methods use stronger language or image initialization compared to our lightweight UniFormer, but all of them are only pre-trained on datasets no larger than IIT-CDIP.Although UniFormer is only using 115M parameters and BERT initialization, it outperforms all baseline approaches after pre-training on our DocumentNet dataset, with FUNSD entity F1 84.18, CORD entity F1 96.45, and RVL-CDIP accuracy 95.34.

Few-shot VDER Setting
We evaluate the performance of pre-trained Uni-Former models on N-way K-shot meta-learning settings with the CORD dataset.Implementation details.In addition to the Simple prediction head used in the classic setting, we also adopt a two-level Hierarchical prediction head.At the first level, it does a binary classification of the O-tag to identify background tokens.Nonbackground tokens are further classified by the second level.Hierarchical prediction helps reduce the label imbalance problem where the majority of the tokens are labeled as background.After eliminating a few entities that do not appear frequently enough, we use 18 entities for meta-train and 5 entities for meta-test, for a total of 23 entities.We fine-tune for 15 steps with a constant learning rate of 0.02.
Results.As shown in Tab 5, adding the Docu-mentNet data significantly boosts the performance of our models across all few-shot learning settings.In particular, the 30M DocumentNet-v2 variant yields a much larger improvement than the 9.9M DocumentNet-v1.The amount of data and the diversity in terms of the collected document type played a significant role in the performance improvements.Performance improvements are universal across each of the metrics, with recall improvements more significant than precision.

Conclusions
In this paper, we proposed a method to use massive and noisy web data to benefit the training of VDER models.Our approach has the benefits of providing a large amount of document data with little cost compared to usual data collection processes in the VDER domain.Our experiments demonstrated significantly boosted performance in both the classic and the few-shot learning settings.There are a number of areas that would warranty extensions or future work.First, a systematic study on the exact keywords and strategies of collecting such a data that would optimize the model outcome is yet to be studied.The methods proposed in this paper is merely a starting point for methods along this direction.Secondly, architecture changes that specifically targets the proposed methods of massive and noisy data collecting remains an open research question.One observation we had when examining the data is that many of them contains empty forms while others have filled in content.Models that can explicitly take advantage of both formats should further boost the performance of the model.a pretrained CNN (Xu et al., 2021) or manually defined patches (Huang et al., 2022).• It obtains an aligned partition of the visual information with the text tokens, encouraging better cross-modal interaction.• It eliminates the need for separate visual tokens as in (Xu et al., 2021;Huang et al., 2022), resulting in a shorter token sequence and better efficiency, as shown in Fig. 5. • It provides a unified joint representation for text and visual modalities in document modeling with semantic-level granularity.

A.3 Pretraining
During pretraining, we adopt the following objectives on a UniFormer parameterized by θ.For each objective, we use a separate head upon the last attention layer.Let ρ denote the always available input embeddings, including the 1D and 2D positions.

Multimodal Masked Language Modeling (MMLM)
We randomly select 15% (Devlin et al., 2019) of the tokens, denoted as M, to mask and predict the language modality.In the masked language input c, 80% of the masked tokens are replaced with a special [MASK] token, while another 10% are replaced with a random token and the remaining 10% are kept as is.In the masked crop input p, crops for all masked tokens are replaced with an empty image.The language prediction is formulated as a multi-class classification problem with the cross-entropy loss as Masked Crop Modeling (MCM) We also predict the visual modality by reconstructing the image crops for the masked tokens in MMLM, in a way similar to MAE (He et al., 2022).It is formulated as a regression problem with a linear layer over flattened pixels.The MCM loss is defined as Token Tagging (TT) We add an extra pretraining task by predicting the tags t for each token in an unmasked sequence.The tags are extracted from an external text tagger as described in Sec.
3. Since each token may have multiple tags, it is formulated as a multi-label classification problem with the binary cross-entropy loss as Pretraining Loss The overall pretraining objective is given as where α, β are the corresponding loss weights.The prediction of BIO tags is modeled as a multiclass classification problem with the objective as

A.4 Finetuning
Document Classification We use the embedding of the starting [CLS] token for document classification.The logits are predicted with an MLP head on top of the [CLS] embedding.Let l be the correct class, the objective is B Additional Related Works  (Huang et al., 2019), Kleister (Stanisławek et al., 2021) NDA and Charity, DeepForm (Borchmann et al., 2021), VRDU (Wang et al., 2022b) Ad-buy and Registration, have been introduced since then, at the scale of a few thousand documents.Among them, DocVQA (Mathew et al., 2021) contains 12.8K documents with question-answer annotations.
Larger document datasets IIT-CDIP (Lewis et al., 2006) consists of 11M unlabeled documents with more than 39M pages.PDF files from Common Crawl (CC-PDF) and UCSF Industry Documents Library (UCSF-IDL) have also been used for pretraining (Powalski et al., 2021), with a total of less than 1M documents.RVL-CDIP, a subset of IIT-CDIP, contains 400K documents categorized into 16 classes for the document classification task.PubLayNet (Zhong et al., 2019) is at a similar scale but for the layout detection task with bounding box and segmentation annotations.

B.2 Document Understanding Models
Document understanding models have emerged since LayoutLM (Xu et al., 2020), which extends BERT (Devlin et al., 2019) with spatial and visual information.Various models use different initialization weights, model scales, and pretraining data configurations.Table 4 provides a detailed comparison of existing models.
Text Modality.Document models are usually built upon a pretrained language model.As shown by LayoutLM (Xu et al., 2020), language initialization significantly impacts the final model performance.Many works have been built upon the standard BERT language model, such as Lay-outLM (Xu et al., 2020), BROS (Hong et al., 2022), SelfDoc (Li et al., 2021), and UDoc (Gu et al., 2021).LayoutLMv2 (Xu et al., 2021) is initialized from the UniLM (Dong et al., 2019).TILT (Powalski et al., 2021) extends T5 (Raffel et al., 2020) for document analysis.DocFormer (Appalaraju et al., 2021) directly initializes from a pretrained LayoutLM.The recent LiLT (Wang et al., 2022a) and LayoutLMv3 (Huang et al., 2022) models are initialized from RoBERTa (Liu et al., 2019) to provide a stronger language prior.In our experiments, we adopt the vanilla BERT-base model for fair com-  and predictions from our UniFormer.As we can see in the annotation, the reading order is often weird and does not follow human conventions.However, the 2D positional embedding and spatial-aware attention can correctly handle them regardlessly.In the prediction samples, we observe that the predictions for question and answer fields are mostly correct, while a few errors are made for header due to ambiguity.
C.2 Few-shot VDER Setting N-way K-shot meta-learning formulation.In our setting, we define a N -way K-shot problem to be one such that there are N novel classes that appear no more than K times in the training set.
We then divide a dataset into several sub-groups with each of them satisfying the N -way K-shot definition.One unique characteristic on the VDER dataset is that documents usually contain multiple entities, with many of the entities occur more than once in a single document, we make the requirements on the number of occurrence K to be a soft one so that it would be realistic to generate such a dataset splitting.The few-shot learning problem will natually fit into a meta-learning scenario, meta-train and meta-test contain a set of tasks satisfying N-way K-shot setting.
We sample datasets to achieve n-way, k-shot settings, which means that our training data contains n entities, each with at least k occurrences.The count of classes in testing is fixed at 5. For hyper-parameters, we follow most of the settings for classic VDER experiments.We fine tune with a learning rate of 0.02.

Figure 1 :
Figure 1: Exemplar documents of each of the four top-level hierarchies.Images are downloaded via keyword searching using a commercial search engine.All images are for demonstration purposes only and do not contain real transactions or personal information.

Figure 3 :
Figure 3: Mean and standard deviation of the dotproduct distance between the retrieved 30M document images and each query keyword.A distance of 1.0 indicates the closest semantic relevance.

Figure 4 :
Figure 4: UniFormer pre-training pipeline.The multimodal tokenization process (left) outputs tokens with aligned image crops.The UniFormer model (right) learns a unified token representation with three objectives (top).

Figure 5 :
Figure 5: Unaligned (left) vs. Aligned (right) visual features.The unaligned visual features result in a longer sequence but are usually discarded in downstream tasks.T: Text, I: Image.

Fig. 6
Fig.6illustrates the pipeline for the finetuning of UniFormer.During finetuning, no tokens are masked.In this paper, we adopt the following two tasks in finetuning.

Figure 7 :
Figure 7: Visualization of annotation (left) and prediction examples (middle and right) from the FUNSD validation set.Zoom in for details.

Fig. 8
Fig. 8 illustrates the document ontology tree stub used for the construction of DocumentNet.Below we list all of the search keywords organized into four groups.D.1 Financial Documents

Table 1 :
Comparison between the proposed DocumentNet dataset and existing document understanding datasets.
Datasets from other areas also built with ontology are listed in gray.Annotation includes sample type (T), bounding box (B), entity (E), and question (Q), where the value refers to the number of classes.
Table 2 lists the pre-training objectives and corresponding target modalities.

Table 3 :
Ablation studies on three document understanding benchmarks regarding pretraining datasets, pretraining objectives, and model architectures.Input modalities include text (T), layout (L), and crop (C).
warm-up in the first 10% steps and then linear decay.Dropout with 0.1 probability is applied in the head layers.UniFormer is fine-tuned for 1000 steps with a batch size of 32 on FUNSD and 256 on CORD.For document classification on RVL-CDIP, we add a multi-class classification head on top of the [CLS] token.We fine-tune with a constant learning rate of 10 −5 for 15000 steps with a batch size of 2048.
Implementation details.For entity extraction on FUNSD and CORD, we add a Simple multiclass classification head on top of all text tokens to perform BIO tagging.We fine-tune with a peak learning rate of 5 × 10 −5 , following a schedule of linear

Table 4 :
Comparison with state-of-the-art document pretraining approaches on three document understanding benchmarks.
Detailed task setups are introduced in App.C.2. * denotes a variant that does not use its proprietary tokenizer in pre-training.

Table 5 :
Performance comparisons on the few-shot VDER settings with the CORD dataset.
Error analysis.Table 6 lists the detailed metrics on the FUNSD entity extraction task.Among the three labeled entity types, header has the poorest performance and the lowest number of examples.The other two types have much better performance with F1 86.59 for question and F1 84.91 for answer.Fig. 7 visualizes a few examples with annotations