A Critical Analysis of Document Out-of-Distribution Detection

,


Introduction
The recent success of large-scale pre-training has propelled the widespread deployment of deep learning models in the document domain, where model predictions are used to help humans make decisions in various applications such as tax form processing and medical reports analysis.However, models are typically pre-trained on data collected from the web but deployed in an environment with distributional shifts (Cui et al., 2021).For instance, the outbreak of COVID-19 has led to continually Right: During inference time, an OOD score can be derived based on logits g(x) or feature embeddings z := h(x).A document input x is identified as OOD if its OOD score is below some threshold γ.
changing data distributions in machine-assisted medical document analysis systems (Velavan and Meyer, 2020).This motivates the need for reliable document understanding models against outof-distribution (OOD) inputs.
The goal of OOD detection is to categorize indistribution (ID) samples into one of the known categories and detect inputs that do not belong to any known classes at test time (Bendale and Boult, 2016).A plethora of OOD detection methods has been proposed for single-modal (image or text) inputs (Ge et al., 2017;Nalisnick et al., 2019;Oza and Patel, 2019;Tack et al., 2020;Hsu et al., 2020;Arora et al., 2021;Zhou et al., 2021;Xiao et al., 2020;Xu et al., 2021a;Li et al., 2021b;Shen et al., 2021;Jin et al., 2022;Zhou et al., 2022;Ming et al., 2022b,c;Podolskiy et al., 2021;Ren et al., 2023).Recent works (Fort et al., 2021;Esmaeilpour et al., 2022;Ming et al., 2022a;Ming and Li, 2023;Bitterwolf et al., 2023) also demonstrate promising OOD detection performance based on large-scale models pre-trained on text-image pairs, as pre-training enables models to learn powerful and transferable feature representations (Radford et al., 2021).However, it remains largely unexplored if existing findings in the OOD detection literature for images or texts can be naturally extended to the document domain.
Multiple unique challenges exist for document OOD detection.Unlike natural images, texts, or image-text pairs, no captions can describe a document and images in documents rarely contain natural objects.Moreover, the spatial relationship of text blocks further differentiates multimodal learning in documents from multimodal learning in the vision-language domain (Lu et al., 2019;Li et al., 2020).In addition, while recent pre-training methods have demonstrated remarkable performance in downstream document understanding tasks (Xu et al., 2020(Xu et al., , 2021b;;Li et al., 2021a;Gu et al., 2022;Hong et al., 2022;Huang et al., 2022;Li et al., 2022;Wang et al., 2022a), existing pre-training datasets for documents are limited and lack diversity.This is in sharp contrast to common pretraining datasets for natural images.It remains underexplored whether existing OOD detection methods are reliable in the document domain and how pre-training impacts OOD reliability.
In this work, we first present a comprehensive study to better understand OOD detection in the document domain through the following questions: (1) What is the role of document pre-training?How do pre-training datasets and tasks affect OOD detection performance?(2) Are existing OOD detection methods developed for natural images and texts transferrable to documents?(3) How does modality (textual, visual, and especially spatial information) affect OOD performance?In particular, we find that spatial information is critical for improving OOD reliability.Moreover, we propose a new spatial-aware adapter, a small learned module that can be inserted within a pre-trained language model such as RoBERTa (Liu et al., 2019).Our module is computationally efficient and significantly improves both ID classification and OOD detection performance (Sec.5.2).Our contributions are summarized as follows: • We provide an extensive and in-depth study to investigate the impacts of pre-training, fine-tuning, model-modality, and OOD scoring functions on a broad spectrum of document OOD detection tasks.Our codebase will be open-sourced to facilitate future research.
• We present unique insights on document OOD detection.For example, we observe that distancebased OOD scores are consistently advantageous over logit-based scores, which is underexplored in the recent OOD detection literature on visionlanguage pre-trained models.
• We further propose a spatial-aware adapter module for transformer-based language models, facilitating easy adaptation of pre-trained language models to the document domain.Extensive experiments confirm the effectiveness of our module across diverse types of OOD data.
2 Preliminaries and Related Works

Document Models and Pre-Training
Large-scale pre-trained models gradually gain popularity in the document domain due to their success in producing generic representations from largescale unlabeled corpora in vision and natural language processing (NLP) tasks (Devlin et al., 2018;Lu et al., 2019;Su et al., 2019;Schiappa et al., 2022).As documents contain both visual and textual information distributed spatially in semantic regions, document-specific models and pre-training objectives are often necessary, which are distinct from vision or language domains.We summarize common model structures for document pre-training in Fig. 2a.Specifically, Lay-outLM (Xu et al., 2020) takes a sequence of Optical Character Recognition (OCR) (Smith, 2007) words and word bounding boxes as inputs.It extends BERT to learn contextualized word representations for document images through multitask learning.LayoutLMv2 (Xu et al., 2021b) improves on the prior work with new pre-training tasks to model the interaction among texts, layouts, and images.DocFormer (Appalaraju et al., 2021) adopts a CNN model to extract image grid features, fusing the spatial information as an inductive bias for the self-attention module.LayoutLMv3 (Huang et al., 2022) further enhances visual and spatial characteristics with masked image modeling and word-patch alignment tasks.Another line of work focuses on various granularities of documents, such as regionlevel text/image blocks.Examples of such models include SelfDoc (Li et al., 2021a), UDoc (Gu et al., 2021), andMGDoc (Wang et al., 2022b), which are pre-trained with a cross-modal encoder to capture the relationship between visual and textual features.These models incorporate spatial information by fusing position embeddings at the output layer of their encoders, instead of the input layer.Additionally, OCR-free models (Kim et al., 2022;Tang et al., 2023) tackle document understanding as a se-quence generation problem, unifying multiple tasks through an image-to-sequence generation network.
While these pre-trained models demonstrate promising performance on downstream applications, their robustness to different types of OOD data, the influence of pre-training and fine-tuning, and the value of different modalities (e.g.spatial, textual, and visual) for document OOD detection remain largely unexplored.

Out-of-Distribution Detection
OOD detection has been extensively studied for open-world multi-class classification with natural image and text inputs, where the goal is to derive an OOD score that separates OOD from ID samples.A plethora of methods are proposed for deep neural networks, where the OOD scoring function is typically derived based on logits (without softmax scaling) (Hendrycks et al., 2022), softmax outputs (Liang et al., 2018;Hsu et al., 2020;Huang and Li, 2021;Sun et al., 2021), gradients (Huang et al., 2021), and feature embeddings (Tack et al., 2020;Fort et al., 2021;Ming et al., 2023).Despite their impressive performance on natural images and texts, it is underexplored if the results are transferrable to the document domain.A recent work (Larson et al., 2022) studied OOD detection for documents but only explored a limited number of models and OOD detection methods.The impacts of pre-training, fine-tuning, and spatial information remain unknown.In this work, we aim to provide a comprehensive and finer-grained analysis to shed light on the key factors for OOD robustness in the document domain.
Notations.Following prior works on OOD detection with large-scale pre-trained models (Ming et al., 2022a;Ming and Li, 2023), the task of OOD detection is defined with respect to the downstream dataset, instead of the pre-training data which is often hard to characterize.In document classification, we use X in and Y in = {1, . . ., K} to denote the input and label space, respectively.Let D in = {(x in i , y in i )} N i=1 be the ID dataset, where x ∈ X in , and y in ∈ Y in .Let D out = {(x out i , y out i )} M i=1 denote an OOD test set where y out ∈ Y out , and Y out ∩ Y in = ∅.We express the neural network model f := g • h as a composition of a feature extractor h : X → R d and a classifier g : R d → R K , which maps the feature embedding of an input to K real-valued numbers known as logits.During inference time, given an input x, OOD detection can be formulated as: where S(•) is a scoring function that measures OOD uncertainty.In practice, the threshold qγ is often chosen so that a high fraction of ID data (e.g., 95%) is above the threshold.
OOD detection scores.We focus on two major categories of computationally efficient OOD detection methods1 : logit-based methods derive OOD scores from the logit layer of the model, while distance-based methods directly leverage feature embeddings, as shown in Fig. 1.We describe a few popular methods for each category as follows.
• Logit-based: Maximum Softmax Probability (MSP) score (Hendrycks and Gimpel, 2017) x) naturally arises as a classic baseline as models often output lower softmax probabilities for OOD data; Energy score (Liu et al., 2020): x) utilizes the Helmholtz free energy of the data and theoretically aligns with the logarithm of the ID density; the simple MaxLogit score (Hendrycks et al., 2022): has demonstrated promising performance on large-scale natural image datasets.We select the above scores due to their simplicity and computational efficiency.In addition, recent studies demonstrate that such simple scores are particularly effective with large-scale pre-trained models in vision (Fort et al., 2021) and visionlanguage domains (Ming et al., 2022a;Bitterwolf et al., 2023).We complement previous studies and investigate their effectiveness for documents.
• Distance-based: Distance-based methods directly leverage feature embeddings z = h(x) based on the idea that OOD inputs are relatively far away from ID clusters in the feature space, compared to ID inputs.Distance-based methods can be characterized as parametric and non-parametric.Parametric methods such as Mahalanobis score (Lee et al., 2018;Sehwag et al., 2021) assume ID embeddings follow classconditional Gaussian distributions and use the Mahalanobis distance as the distance metric.On the other hand, non-parametric methods such as KNN+ (Sun et al., 2022) use cosine similarity as the distance metric.

Visual Encoder
Textual Encoder

Ours
Wor ds+BBoxes  Evaluation metrics.To evaluate OOD detection performance, we adopt the following commonly used metrics: the Area Under the Receiver Operating Characteristic (AUROC), False Positive Rate at 95% Recall (FPR95), and the multi-class classification accuracy (ID Acc).
3 Experimental Setup Models.Fig. 2a summarizes common structures for document pre-training and classification models2 .While documents typically come in the form of images (Harley et al., 2015), an OCR system can be used to extract words and their coordinates from the input image.Therefore, models can use singlemodal or multi-modal information.We categorize these models according to the input modalities into the following groups: (1) models using only visual features, (2) models using solely textual features, (3) models incorporating both visual and textual features, and (4) models integrating additional spatial (especially layout) information.Further details can be found in Appendix A.
• Vision-only: Document classification can be viewed as a standard image classification problem.We consider ResNet-50 (He et al., 2016) and ViT (Fort et al., 2021) as exemplar document image classification models.We adopt two common pre-training settings: (1) only pre-trained on ImageNet (Deng et al., 2009) and (2) further pre-trained on IIT-CDIP (Lewis et al., 2006) with masked image modeling (MIM)3 .After pretraining, we append a classifier for fine-tuning.
• Text-only: Alternatively, we can view document classification as text classification since documents often contain text blocks.To this end, we use RoBERTa (Liu et al., 2019) and Longformer (Beltagy et al., 2020) as the backbones.RoBERTa can handle up to 512 input tokens while Longformer can handle up to 4,096 input tokens.We pre-train the language models with masked language modeling (MLM) on IIT-CDIP extracted text corpus.
• Text+Layout: Layout information plays a crucial role in the document domain, as shown in Fig. 3. To investigate the effect of layout information, we adopt LayoutLM as the backbone.We will show that spatial-aware models demonstrate promising OOD detection performance.However, such specialized models can be computationally expensive.Therefore, we propose a new spatial-aware adapter, a small learned module that can be inserted within a pre-trained language model such as RoBERTa and transforms it into a spatial-aware model, which is computationally efficient and competitive for both ID classification and OOD detection (Sec.5.2).
• Vision+Text+Layout: For comprehensiveness, we consider LayoutLMv3 and UDoc, which are large and computationally intensive.Both models are pre-trained on the full IIT-CDIP for fairness.These models utilize different input granularities and modalities, including textual, visual, and spatial information for document tasks.
this paper, RVL-CDIP (Harley et al., 2015), is a subset of IIT-CDIP.Hence, unless otherwise specified, the IIT-CDIP pre-training data used in this paper excludes RVL-CDIP.
Constructing ID and OOD datasets.We construct ID datasets from RVL-CDIP (Harley et al., 2015), where 12 out of 16 classes are selected as ID classes.Dataset details are in Appendix A. We consider two OOD scenarios: in-domain and outdomain, based on the content (e.g., words, background) and layout characteristics.
• In-domain OOD: To determine the OOD categories, we analyzed the performance of recent document classification models on the RVL-CDIP test set.Fig. 2b shows the per-category test accuracy of various models.Naturally, for the classes the models perform poorly on, we may expect the models to detect such inputs as OOD instead of assigning a specific ID class with low confidence.We observe that the 4 categories (letter, form, scientific report, and presentation) result in the worst performance across most of the models with different modalities.We use these as OOD categories and construct the OOD datasets accordingly.The ID dataset is constructed from the remaining 12 categories, which we refer to as in-domain OOD datasets, as they are also sourced from RVL-CDIP.
• Out-domain OOD: In the open-world setting, test inputs can have significantly different color schemes and layouts compared to ID samples.
To mimic such scenarios, we use two public datasets as out-domain OOD test sets: NJU-Fudan Paper-Poster Dataset (Qiang et al., 2019) and CORD (Park et al., 2019).NJU-Fudan Paper-Poster Dataset contains scientific posters in digital PDF format4 .CORD is a receipt understanding dataset with significantly different inputs compared to RVL-CDIP.As shown in Fig. 3, receipt images can be challenging and require models to handle not only textual but also visual and spatial information.
We further support our domain selection using OTDD (Alvarez-Melis and Fusi, 2020), a flexible geometric method for comparing probability distributions, which enables us to compare any two datasets regardless of their label sets.We observe a clear gap between in-domain and out-domain data, which aligns with our data selection.Further details can be found in Appendix A.1.We observe that: (1) for out-domain OOD data (Fig. 4a, right), increasing the amount of pretraining data can significantly improve the zeroshot OOD detection performance (w.o.fine-tuning) for models across different modalities.Our hypothesis is that pre-training with diverse data is beneficial for coarse-grained OOD detection, such as inputs from different domains (e.g., color schemes).
(2) For in-domain OOD inputs, even increasing the amount of pre-training data by over 40% provides negligible improvements (Fig. 4a, left).This suggests the necessity of fine-tuning for improving in-domain OOD detection performance (Fig. 6).
We further explore a more restricted setting for zero-shot OOD detection where potential OOD categories are removed from the pre-training dataset IIT-CDIP.First, we use LayoutLM fine-tuned on RVL-CDIP to predict labels for all documents in IIT-CDIP.Fig. 4b summarizes the distribution of the predicted classes on IIT-CDIP.Next, we remove the "OOD" categories from IIT-CDIP and pretrain two models (RoBERTa and LayoutLM) with 10, 20, 40, and 100% of randomly sampled data from the filtered IIT-CDIP (dubbed III-CDIP − ), respectively.The zero-shot OOD performance for in-domain and out-domain OOD is shown in Fig. 4c 5 .For RoBERTa, we observe similar trends as in Fig. 4a, where increasing the amount of pretraining data improves zero-shot OOD detection performance for out-domain data.However, the zero-shot performance of LayoutLM benefits from a larger pre-training dataset.In particular, given the same amount of pre-training data, LayoutLM consistently outperforms RoBERTa for both indomain and out-domain OOD detection, which suggests that spatial information can be essential In-domain OOD detection   for boosting the OOD reliability in the document domain.Motivated by the above observations, we dive deeper and analyze spatial-aware models next.While pre-trained models exhibit the capability to differentiate data from various domains as a result of being trained on a diverse range of data.We observe that achieving more precise separation for in-domain OOD inputs remains difficult.Given this observation, we further analyze the impacts of fine-tuning for OOD detection with fixed pretraining datasets in the next section.By combining pre-trained models with a simple classifier and finetuning on RVL-CDIP (ID), we find that fine-tuning is advantageous in enhancing the OOD detection performance for both types of OOD samples.

The Impact of Fine-Tuning on Document OOD Detection
Recent document models are often pre-trained on a large-scale dataset and adapted to the target task via fine-tuning.To better understand the role of fine-tuning, we explore the following questions: 1) How does fine-tuning impact OOD reliability for in-domain and out-domain OOD inputs?2) How does model modality impact the performance?
We consider a wide range of models pretrained on pure-text/image data (e.g., ImageNet and Wikipedia) described in Appendix A.3.During fine-tuning, we combine pre-trained models with a simple classifier and fine-tune on RVL-CDIP (ID).For models before and after fine-tuning, we extract the final feature embeddings and use a distancebased method KNN+ (Sun et al., 2022) for OOD detection.The results are shown in Fig. 6.We observe the following trends.First, fine-tuning largely improves OOD detection performance for both in-domain and out-domain OOD data.The same trend holds broadly across models with different modalities.Second, the improvement of fine-tuning is less significant for out-domain OOD data.For example, on Receipt (out-domain OOD), the AUROC for pre-trained ViT model is 97.13, whereas fine-tuning only improves by 0.79%.This suggests that pre-trained models do have the potential to separate data from different domains due to the diversity of data used for pre-training, while it remains hard for pre-trained models to perform finer-grained separation for in-domain OOD inputs.Therefore, fine-tuning is beneficial for improving OOD detection performance for both types of OOD samples.To further validate our conclusion, we consider two additional in-domain OOD settings for our analysis: (1) selecting the classes the model performs well on, as in-domain OOD categories; (2) randomly selecting classes as OOD categories (Appendix A.2). We find that fine-tuning improves OOD detection for both settings, further verifying our observations.Next, we take a closer look at the impact of model modality on out-domain OOD detection.As shown in Fig. 6 (mid and right), both vision and text-based models demonstrate strong reliability against scientific posters (OOD).However, visionbased models display stronger performance than text-based models for Receipts (OOD).This can be explained by the fact that ViT was first pre-trained on ImageNet while scientific posters and receipts contain diverse visual information such as colors and edges for vision models to utilize (see Fig. 3).On the other hand, although fine-tuning text-based models largely improves the detection performance compared to pre-trained counterparts, utilizing only textual information can be inherently limited for out-domain OOD detection.

The Importance of Spatial-Awareness
In previous sections, we mainly focus on mainstream text-based and vision-based models for inand out-domain OOD detection.Next, we consider models tailored to document processing, which we refer to as spatial-aware models, such as Lay-outLMv3 and UDoc.Given fine-tuned models, we compare the performance of logit-based and distance-based OOD scores.

Analysis of Spatial-Aware Models
We summarize key comparisons in Fig. 5, where we use MSP and Energy as exemplar logit-based scores and KNN+ as the distance-based score.Full results are in Appendix C. We can see that the simple KNN-based score (KNN+) consistently outperforms logit-based scores for both in-domain and out-domain OOD data across different models with different modalities.This is in contrast with recent works that investigate large-scale pre-trained models in the vision-language domain, where logitbased scores demonstrate strong OOD detection performance (Fort et al., 2021).As documents are distinct from natural image-text pairs, observations in the vision-language domain do not seamlessly translate to the document domain.Moreover, spatial-aware models demonstrate stronger OOD detection performance for both in and out-domain OOD.For example, with the best scoring function (KNN+), LayoutLMv3 improves the average AU-ROC by 7.09% for out-domain OOD and 7.54% for in-domain OOD data compared to RoBERTa.This further highlights the value of spatial information for improving OOD robustness for documents.Despite the impressive improvements brought by spatial-aware models, acquiring a large-scale pretraining dataset that includes spatial information remains challenging.In contrast, there is a growing abundance of pre-trained language models that are based on textual data.This motivates us to explore the possibility of leveraging these pre-trained language models by training an adapter on a small dataset containing document-specific information.By adopting this approach, we can effectively utilize existing models while minimizing the time and cost required for training.

Towards Effective Spatial-Aware Adapter
During our investigation into the effects of model modality, pre-training, and fine-tuning on various types of OOD inputs, we find that spatial/layout information plays a critical role in the document domain.However, existing pre-training models such as LayoutLM series, SelfDoc, and UDoc do not fully leverage the benefits of well-pre-trained language models.This raises the question of whether a large-scale language model, such as RoBERTa, can be adapted to detect OOD documents effectively.In this section, we demonstrate that incorporating an adapter module that accounts for spatial information with transformer-based pre-trained models can achieve strong performance with minimal changes to the code.To the best of our knowledge, this is the first study to apply the adapter idea to documents.
Spatial-aware adapter.Given a pre-trained language model such as RoBERTa, we propose an adapter that utilizes spatial information.We consider two potential designs: 1) the adapter is ap- pended to the word embedding layer, denoted as Spatial-RoBERTa (pre), which requires both pretraining and fine-tuning.This architecture is illustrated in the top row of Fig. 7. 2) The adapter is appended to the final layer of the text encoder, denoted as Spatial-BoBERTa (post), which only requires fine-tuning as the model can utilize the pre-trained textual encoder, as shown in the bottom row of Fig. 7.
For Spatial-RoBERTa (pre), we freeze the word embedding layer during pre-training for several considerations: 1) word embeddings learned from large-scale corpus already cover most of those words from documents; 2) pre-training on documents without strong language dependency may not help improve word embeddings.For example, in semi-structured documents (e.g., forms, receipts), language dependencies are not as strong as in text-rich documents (e.g., letters, resumes), which may degenerate the learned word representations.In practice, each word has a normalized bounding box (x 0 , y 0 , x 1 , y 1 ), where (x 0 , y 0 ) / (x 1 , y 1 ) corresponds to the position of the upper left / lower right in the bounding box.To encode positional information, we employ four position embedding layers, where each layer= encodes one coordinate (e.g., x 0 ) and produces a corresponding position embedding.The special tokens ([CLS], [SEP], and [PAD]) are attached with an empty bounding box (0, 0, 0, 0).As depicted in the top row of Fig. 7, the spatial-aware word embeddings are formed by adding position embeddings to their corresponding word embeddings.
For Spatial-RoBERTa (post), position embeddings are added through late fusion in the final hidden states during fine-tuning without affecting the pre-trained encoder.Our experiments demonstrate that introducing spatial-aware adapters during pretraining yields better results than only adding position embeddings during fine-tuning.For additional details6 , please refer to Appendix C. In the following, we focus on analyzing Spatial-RoBERTa (pre) and comparing both ID and OOD performance with that of the pure-text pre-trained RoBERTa.

Spatial-RoBERTa significantly outperforms
RoBERTa.To verify the effectiveness of Spatial-RoBERTa, we compare the OOD detection performance of pre-trained and fine-tuned models.The results are shown in Fig. 8, where OOD performance is based on KNN+ (K=10).Full results can be seen in Table 6.Spatial-RoBERTa significantly improves the OOD detection performance, especially after fine-tuning.For example, compared to RoBERTa (base), Spatial-RoBERTa (base) improves AUROC significantly by 4.24% averaged over four in-domain OOD datasets.This further confirms the importance of spatial information for OOD detection in the document domain.
Spatial-RoBERTa is competitive for both ID classification and OOD detection.Beyond OOD detection performance, we also examine the multi-class ID classification accuracy and plot the two metrics for all models with different modalities in Fig. 9.We can clearly observe a positive correlation between ID accuracy and OOD detection performance (measured by AUROC) for both in-domain and out-domain OOD data.Moreover, spatial-aware models display superior ID accuracy and OOD robustness compared to text-only and vision-only models.Overall, Spatial-RoBERTa greatly improves upon RoBERTa and matches the performance of models with more complex and specialized architectures such as LayoutLM.Specifically, Spatial-RoBERTa Large achieves 97.37 ID accuracy, which is even higher than LayoutLM (97.28) and UDoc (97.36).
To summarize, our spatial-aware adapter effectively adapts pre-trained transformer-based text models to the document domain, improving both ID and OOD performance.In addition, by freezing the original word embeddings during pre-training, the models (Spatial-RoBERTa Base and Spatial-RoBERTa Large ) are parameter-efficient and thus reduce the training cost.

Conclusions
In this work, we provide a comprehensive and indepth study on the impacts of pre-training, finetuning, model-modality, and OOD scores on a broad variety of document OOD detection tasks.We present novel insights on document OOD detection, which are under-explored or in contrast with OOD detection works based on vision-language models.In particular, we highlight that spatial information is critical for OOD detection in documents.We further propose a spatial-aware adapter as an add-on module to transformer-based models.Our module adapts pre-trained language models to the document domain.Extensive experiments on a broad range of datasets verify the effectiveness of our design.We hope our work will inspire future research toward improving OOD robustness for reliable document understanding.

Limitations
In this work, our main focus is on OOD detection for document understanding, with a specific emphasis on the context of document classification.As OOD detection based on document pre-trained models remains largely underexplored, we believe establishing an in-depth and extensive study of OOD detection for document classification would be a valuable stepping stone towards more complex tasks.Apart from document classification, in the Appendix B, we also investigate OOD detection for two entity-level tasks: document entity recognition and document object detection.We leave a more comprehensive treatment for future works.

A Dataset and Model Details
A.1 Datasets The full RVL-CDIP dataset consists of 320K/40K/40K training/validation/testing images under 16 categories.We select 12 of them as the ID (In-domain) data.We employ the Google OCR engine7 to extract the text and layout information, which provides tokens, text blocks and the corresponding bounding boxes.

A.2 Quantifying OOD Dataset Construction
The distance between datasets can be measured via Optimal Transport Dataset Distance (OTDD)8 .We visualize the OTDD distance between ID and the OOD (both in-domain and out-domain) data in Fig. 10a, where we highlight the in-domain OOD data in blue and the out-domain OOD data in green.Specifically, we randomly sample 1000 images from each dataset and calculate the average distance between pairs of datasets.We can see a significant gap between the OTDD of in-domain OOD data and out-domain OOD data.To make the analysis more thorough, we consider two additional in-domain OOD settings: (1) select the classes the model performs well as OOD data; (2) randomly select classes as OOD data.The results are shown in Fig. 10b and Fig. 10c.We can see that the distance between ID and in-domain OOD is similar to the original scheme (Fig. 10a).This suggests that most in-domain OOD categories are not far from ID data.
While this paper represents an initial endeavor, we hope that our work will serve as a stepping stone towards constructing more comprehensive and diverse OOD benchmarks in the document domain, akin to those available in the NLP and natural image domain.

A.3 Models and Training Details
All models reported in Fig. 2b, except UDoc, are initialized with pre-trained weights from Huggingface9 and fine-tuned on the full RVL-CDIP training set.During fine-tuning, we train these models on RVL-CDIP with the cross-entropy loss.The models were optimized with Adam optimizer (Kingma and Ba, 2014) for 30 epochs with a batch size of 50 and a learning rate of 2 × 10 −5 on 8 A100 GPUs.
The following are the hyperparameters of the models used in our paper: Text-only: • BERT and RoBERTa: We adopt RoBERTa Base (12 layers) and BERT Base (12 layers) as backbones and set the maximum sequence length to 512.For RoBERTa, the classifier consists of two linear layers followed by a tanh activation function.
• Longformer Base : We also employ Longformer Base (12 layers) as the backbone and set the maximum sequence length to 4,096.

Vision-only:
• ResNet50: We adopt ResNet50 pre-trained on ImageNet-1k as the backbone.We fine-tune the model at a resolution of 224×224.
• Spatial-RoBERTa Base (Pre): This model combines our spatial-aware adapter to the pretrained RoBERTa Base model.The adapter is applied to the word embedding layer.We freeze the pre-trained word embeddings and optimize the spatial-aware adapter and transformers.
• Spatial-RoBERTa Base (Post): Instead of inserting the spatial-aware adapter in the input layer, this model integrates the spatial-aware adapter at the output layer of the transformer.
1  Overall our experiments show that using feature space makes the scores more distinguishable between and out-of-distributions and, as a result, enables more effective OOD detection.
• UDoc: We use a slight variant of UDoc with the only difference in the sentence encoder, where we adopt a smaller version of the pretrained sentence encoder (all-MiniLM-L6-v2, 6 layers) instead of the larger sentence encoder (bert-base-nli-mean-tokens, 12 layers).

B Beyond Document Classification
In 3 4987 Document Object Detection.For document object detection, we use PubLayNet as the ID dataset and construct the OOD dataset from IIIT-AR-13K.Unlike PubLayNet, where the documents are scientific articles, IIIT-AR-13K is a dataset for graphical object detection in business documents (e.g., annual reports), thus there exists an obvious domain gap.We select natural images as the OOD entity and filter images that contain the OOD entity.Two object detection models are considered in this paper: (1) Vanilla Faster-RCNN with ResNet-50 visual backbone, and (2) Faster-RCNN with VOS (Du et al., 2022), a recent unknown-aware learning framework to improve OOD detection performance for natural images.Following the original paper, we use 1,000 samples for each ID class to estimate the class-conditional Gaussian statistics.The models are trained for 180k iterations with a base learning rate of 0.01 and a batch size of 8 using the Detectron2 framework (Wu et al., 2019).The performance of the models is measured using the mean average precision (MAP) @ intersection over union (IOU) [0.50:0.95] of bounding boxes.Document Entity Recognition.For entity recognition, we construct ID and OOD datasets from FUNSD.Each semantic entity includes a list of words, a label, and a bounding box.The standard label set for this dataset contains four categories: question, answer, header, and other.In this paper, we select entities labeled as other or header as OOD data, and the entities belonging to the other three categories as ID.Instead of treating entity recognition as a named-entity recognition problem, we follow UDoc and solve this problem at the semantic region level.We replace the sentence encoder in UDoc with a smaller sentence encoder (all-MiniLM-L6-v2 10 ) from Huggingface (Wolf et al., 2019).We also have the following model variants to verify the effectiveness of the combination of modalities: textual-only, visual-only, textual+spatial, visual+spatial, and vi-sual+textual+spatial.
We provide details on datasets and models as follows.

B.1 Datasets
The original FUNSD (Jaume et al., 2019) dataset contains 149 training and 50 testing images.For document entity recognition, we treat entities with the category other/header as OOD entities.After 10 https://huggingface.co/sentence-transformers the split, if we consider other as OOD, we have a total of 8,330 ID and 1,019 OOD entities.Otherwise, if we consider header as OOD, we have 8,981 ID and 368 OOD entities in total.
For document object detection, we consider PubLayNet (Zhong et al., 2019), which contains 336K/11K training/validation images with 6 categories (text, title, list, fig., and table).The original IIIT-AR-13K (Mondal et al., 2020) contains (table, fig., natural image, logo, and signature).In this paper, considering the overlap between IIIT-AR-13K and PubLayNet, we select those images containing natural images as the OOD test set.After filtering, we obtain 2,880 OOD entities across 1,837 document images.
We consider three ID datasets in this experiment.(1) PubLayNet: This is the original Pub-LayNet dataset.We treat all the entities in training/validation images as ID entities.(2) Considering the domain shift between ID data (PubLayNet) and OOD data (IIIT-AR-13K).We combine the PubLayNet training data with the images from IIIT-AR-13K with overlapping annotations (table and figure) and train the object detection model.

B.2 Models
Fig. 13 illustrates the entity recognition models used in this paper.We consider the entities on regions instead of tokens, as regions provide richer semantic information.As for the pre-trained model, we adopt UDoc (trained on IIT-CDIP) since it models inputs at the regional level.Based on the UDoc framework, we develop the following models.
Vision/Vision+Layout:  • Sentence BERT+Position: This model is close to the above model but adds position embeddings to the sentence embeddings. Vision+Text+Layout: • ResNet-50+sentence BERT: This model follows the same framework as UDoc, but replaces the sentence encoder in their original design with a more miniature sentence encoder (all-MiniLM-L6-v2).
All the models are fine-tuned with the cross-entropy loss for 100 epochs, using a learning rate of 10 −5 and a batch size of 8 on an A100 GPU.

B.3 Summary of Observations
We provide a summary of observations here and hope to inspire future works on a thorough investigation of OOD detection for entity-level tasks.To identify entity types, models should not only understand the words but also utilize spatial and visual information.
For document entity recognition, the comparison of distance-based and logit-based OOD detection methods with different models are shown in Fig. 14a.More details are shown in Table 2.We see that models can better predict the entity type and also achieve better OOD robustness with the help of spatial information.Considering the weak language dependency between entities, it is not surprising that vision-based models achieve better performance than text-based models.In particular, UDoc with ResNet-50 achieves the best performance on two OOD test sets, illustrating that visual information plays a major role in increasing the discrimination of entities with similar semantics.For document object detection, we summarize our findings in Fig. 14b and describe them in more detail in Table 1.We can see that the OOD detection performance is further improved by introducing document images from IIIT-AR-13K with the same ID annotations as training data.
To provide more intuitions, in Fig. 15, we visualize the document entity recognition OOD detection results.In Fig. 16, we visualize the prediction on sample OOD images, using object detection models trained without VOS (top) and with VOS (bottom), respectively.We can see that vanilla Faster RCNN trained on PubLayNet produces false positives when applied to the OOD document images from IIIT-AR-13K.Table 1 shows that introducing the unknown-aware learning method optimized for both ID and OOD can reduce the FPR95 while preserving the mAP on the ID data.This experiment indicates that incorporating uncertainty estimation into the entity detection training procedure can improve the reliability of the document object detection system.

• Table 2 corresponds to the results shown in
Fig. 15 and Fig. 14a.
• Table 1 corresponds to the results shown in Fig. 16 and Fig. 14b.
• Table 3 and Table 7 correspond to the results shown in Fig. 4a.
• Table 4 and Table 5 correspond to the results shown in Fig. 4c.
• Table 6 corresponds to the results shown in Fig. 8 and Fig. 9.
• Table 9 and Table 8 correspond to the results shown in Fig. 6 and Fig. 9.
• Table 10 and Table 11 correspond to the analysis for Sec. 4 and Sec.4.2.
• Table 12 corresponds to the results shown in Fig. 9. 5

Figure 1 :
Figure 1: Illustration of OOD detection for document classification.The pre-training and fine-tuning pipelines are shown on the top left and bottom left, respectively.Right: During inference time, an OOD score can be derived based on logits g(x) or feature embeddings z := h(x).A document input x is identified as OOD if its OOD score is below some threshold γ.

Figure 2 :
Figure 2: (Left) Illustration of models for document pre-training and classification, with our proposed spatial-aware models in green blocks.Modality information is also shown atop each architecture.(Right) Evaluating fine-tuning performance for document classification of pre-trained models.Models are grouped into several categories (from left to right): language-only, vision-only, and multi-modal.For comparison, the performance of corresponding models in other groups is shown in gray.The average accuracy for each model is indicated in the parenthesis.

Figure 3 :
Figure 3: (Top) Examples of ID inputs sampled from RVL-CDIP (top).(Bottom) In-domain OOD from RVL-CDIP, and out-domain OOD from Scientific Poster and Receipts.

Figure 4 :
Figure 4: The impact of pre-training data on zero-shot OOD detection performance.IIT-CDIP − denotes the filtered pre-training data after removing the "OOD" categories.

Figure 7 :
Figure 7: Illustration of our spatial-aware adapter for language models.We present 2 adapter designs (marked in green box): (1) insert the adapter into the word embedding layer during pre-training and fine-tuning; (2) insert the adapter into the output layer for fine-tuning only.For the first design, we freeze the word embedding layer and learn the adapter and transformer layers.

Figure 8 :
Figure 8: Comparison of OOD detection performance of Spatial-RoBERTa and RoBERTa.All models are initialized with public pre-trained checkpoints trained on purely textual data and further pre-trained on IIT-CDIP.The only difference is that Spatial-RoBERTa has an additional spatial-ware adapter and takes word bounding boxes as additional inputs.

Figure 9 :
Figure9: Correlation between ID accuracy and OOD detection performance.For most models, ID accuracy is positively correlated with OOD detection performance.Language models with spatial-aware adapters (highlighted in blue) achieve significantly higher ID accuracy and stronger OOD robustness (in AUROC) compared to language models without adapters.Here, (+) represents further pre-training on the IIT-CDIP dataset.

Figure 10 :
Figure 10: Visualization of optimal transport dataset distance for ID and OOD (in-domain and out-domain) datasets.We highlight the in-domain OOD data in blue and the out-domain OOD data in green.

Figure 11 :
Figure 11: Feature visualization for pre-trained (with different numbers of pre-training data) and fine-tuned models.We show both in-domain (RVL-CDIP) and out-domain (CORD) OOD datasets.

Figure 13 :
Figure13: The network architectures in green blocks are our proposed models.We also show the modality information on top of each architecture.
the main paper, we mainly focus on document classification to provide a thorough and in-depth analysis.In this section, we go beyond document classification and explore OOD detection for two entity-level tasks in documents: document entity recognition and document object detection.It is natural to detect and recognize basic units in documents such as text, tables, and figures.Document entity recognition aims to predict the label for each semantic entity with given bounding boxes.Document object detection is an object detection task for document images.Specifically, we denote the input as x, the bounding box coordinates associated with object instances in the image as b ∈ R 4 , and use the model with parameters θ to model the bounding box regression p θ (b|x) and the label classification p θ (y|x, b).Given a test input x, the OOD detection scoring function for entity detection and recognition can be unified as S(x, b), where b denotes the object instance predicted by the object detector.In particular, for document entity recognition, since the bounding boxes are provided, the OOD score can be simplified as S(x, b), where b is the given object instance.

Figure 14 :
Figure 14: Ablation on document entity recognition and object detection.Numbers are reported in FPR95.

Figure 15 :
Figure 15: Visualization of detected OOD entities on the form images.The top part shows the entities in blue are entities annotated as other.The bottom part shows the detected OOD entities (green).We also show failure cases on the right part.

Textual Encoder Spatial-Aaware Adapter Word/Path Embedding (1D/2D) Postion Embedding Multimodal Encoder Multimodal Encoder Visual Feature 2D Pos Embedding Textual Feature 2D Pos Embedding
(a) Illustration of common structures for document pretraining and classification.(b) A detailed comparison of per-category accuracy on the RVL-CDIP test set.

Table 1 :
Comparison with different training and detection methods.

Table 2 :
Comparison with different models on FUNSD OOD setting.All models are initialized with UDoc pretrained on IIT-CDIP and fine-tuned on FUNSD data with ID entities.All values are percentages.S-BERT deontes Sentence BERT.A lower FPR95 or a higher AUROC value indicates better performance.

Table 3 :
OOD detection performance for document classification with different number of pre-training data from IIT-CDIP.ID (Acc) denotes the ID accuracy obtained by testing on ID test data.We report the KNN-based scores for both pre-trained and fine-tuned models.Sci.Poster denotes the document images converted from NJU-Fudan Paper-Poster Dataset.Receipt denotes the receipt images collected from the CORD receipt understanding dataset.For in-domain OOD test data, we also report the averaged scores.

Table 4 :
OOD detection performance for document classification with different number of pre-training data from IIT-CDIP − (remove pseudo OOD categories).

Table 5 :
OOD detection performance for document classification with different number of pre-training data from IIT-CDIP − (remove pseudo OOD categories).

Table 6 :
OOD detection performance for document classification.Spatial-RoBERTa Base (Pre) or SR Base (Pre) denotes applying the spatial-aware adapter in the word embedding layer.Spatial-RoBERTa Base (Post) or SR Base (Post) denotes applying the spatial-aware adaptor at the output layer.

Table 7 :
OOD detection performance for document classification with the different number of pre-training data from IIT-CDIP.

Table 8 :
OOD detection performance for document classification.Longformer 4096 denotes the original model adopted from the Huggingface model hub.Longformer 4096 (+) denotes the additional pre-training on IIT-CDIP.

Table 9 :
OOD detection performance for document classification.All models are pre-trained on ImageNet.

Table 10 :
OOD detection performance for document classification (select OOD categories achieve the best performance across most of the models with different modalities).

Table 12 :
OOD detection performance for document classification.All models are pre-trained on IIT-CDIP.For LayoutLM models, we adopt the checkpoints from the Huggingface model hub.For UDoc, we pre-train the model on our side.All models are fine-tuned on RVL-CDIP ID data.