UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive cropping module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets will be released.


Introduction
Leveraging strong Large Language Models as the language decoder, some recent works propose Multimodal Large Language Models (MLLMs) (Zhu et al., 2023;Liu et al., 2023a;Ye et al., 2023;Li et al., 2023) and achieve promising vision-andlanguage understanding performance.Surprisingly, without in-domain training, these MLLMs exhibit shallow zero-shot visual text recognition ability when fed a low-resolution image with salient text information (Ye et al., 2023;Liu et al., 2023b).However, due to the variety of image types and the wide range of image sizes, they are still far from universal visually-situated language understanding, such as extracting information from documents, reading texts from webpages, and visual question and answering on tables, as shown in Figure 1.
Existing works for visually-situated language understanding can be categorized into two-stage (Xu et al., 2021;Huang et al., 2022;Yang et al., 2021) and end-to-end (Davis et al., 2022;Kim et al., 2022;Lee et al., 2022) methods according to whether relying on an off-the-shelf OCR model or API.These works all follow a domain-specific pretraining and finetuning paradigm, thus leading to high training costs, e.g.end-to-end model Donut (Kim et al., 2022) costs more than 192 A100 days.
Inspired by the shallow text recognition ability of existing MLLMs, in this work, we propose UReader for universal OCR-free visually-situated language understanding, which leverages the multimodal Large Language Model via low-cost instruction tuning (Dai et al., 2023).Different from previous works, we forgo pretraining tasks by leveraging the existing MLLM and directly finetune MLLM by taking full advantage of various Visually-situated Language Understanding datasets.To make the most of the strong language understanding ability of MLLM, we convert all tasks into the visionlanguage instruction tuning format.Besides, to enhance text recognition and semantic understanding ability across diverse domains, we design auxiliary text reading and key points generation tasks in the same instruction format.To utilize the lowresolution encoder of MLLM for processing highresolution images and avoid blurry and distortion problems due to resizing, we propose a shapeadaptive cropping module to cut a high-resolution image into multiple local images.Each image is firstly independently encoded with the frozen visual encoder and a trainable visual abstractor and then concatenated to feed into the language decoder.Moreover, we add learnable crop position encoding to help the model correlate local images and add a resized global image to alleviate salient information loss due to cropping.
Our contributions in this work are four-fold: • We first propose instruction tuning with Multimodal Large Language Models for OCR-free Visually-situated Language Understanding.• We build an instruction-tuning dataset covering 5 domains of visually-situated language understanding: document, table, chart, natural image, and webpage screenshot.• We design a shape-adaptive cropping module to utilize the frozen low-resolution vision encoder for processing high-resolution images.• UReader achieves state-of-the-art OCR-free performance in 8 out of 10 tasks, across 5 domains.
According to whether using off-the-shelf OCR models or APIs to recognize texts from images, existing work can be divided into two-stage models (Xu et al., 2021;Huang et al., 2022;Tang et al., 2023;Yang et al., 2021) and end-to-end models (Kim et al., 2022;Davis et al., 2022;Lee et al., 2022).Two-stage work always designs pretrianing tasks to learn cross-modality alignment between visual inputs and text inputs.For example, for document understanding, UDOP (Tang et al., 2023) design a Joint Text-Layout Reconstruction task to recover masked texts and layout information given the visual inputs and retained text inputs.Lay-outLMv3 (Huang et al., 2022) applies a Masked Image Modeling task to recover masked image tokens with the context of their surrounding text and image tokens.Without the help of an off-theshelf OCR model, end-to-end models need to learn text recognition with a high-resolution image encoder during the pretraining stage.For example, Pix2Struct (Lee et al., 2022) proposes a Screenshot Parsing pretraining task, where the model needs to generate the complete HTML DOM tree with only a masked webpage screenshot as the input.Donut (Kim et al., 2022) designs a pretraining task to generate all texts in the document image.These work all follow a domain-specific pretraining and finetuning paradigm and therefore ask for high training costs, e.g.Donut is trained for more than 192 A100 days.In this work, by leveraging the shallow text recognition ability of Multimodal Large Language Models, we propose to directly perform instruction tuning across various types of images and greatly reduce the training cost for universal visually-situated Language Understanding.Multimodal Large Language Model is developed to empower the Large Language Model with multimodality understanding ability, especially for vision information.These work (Huang et al., 2023;Zhu et al., 2023;Liu et al., 2023a;Ye et al., 2023;Li et al., 2023;Dai et al., 2023) mainly connect a pre-trained vision encoder (usually CLIP VIT-L/14 (Radford et al., 2021)) with a strong large language model, such as LLaMA (Touvron et al., 2023).These MLLMs show some emergent abilities, including shallow zero-shot text recognition ability (Liu et al., 2023b).However, they are still far from universal visually-situated language understanding.Firstly, due to the pretraining data for the vision encoder being mostly natural images, MLLMs show barely acceptable text understanding performance on natural images but bad performance on other types, such as document (Liu et al., 2023b).Secondly, most images for visuall-situated language understanding are high-resolution.Rescaling them to low resolution to adapt to the vision encoder can cause the texts blurry and distorted.In this work, we propose to fully leverage the shallow text recognition ability of MLLMs and perform instruction tuning to enhance its universal understanding ability across 5 domains.Besides, we design a shape-adaptive cropping module to alleviate the text blur and distortion problem.

UReader
The primary goal of UReader is to efficiently utilize existing MLLMs for Visually-situated Language Understanding tasks.In this work, we utilize but are not limited to, the mPLUG-Owl (Ye et al., 2023) as our basic MLLM. Figure 2 presents an overall architecture of UReader.The input image is firstly pre-processed by a shape-adaptive cropping module (in Section 3.1).The resulting sub-images are then simultaneously passed through the visual encoder and visual abstractor.To enable the large language model to correlate multiple cropped subimages, we apply a crop position encoding module to introduce spatial information across sub-images.(in Section 3.2).

Shape-Adaptive Cropping Module
Images with texts have various aspect ratios and a great range of resolutions.Simply resizing the im-  age to H v , W v (raw resolution of the MLLM) can result in text being blurred, distorted, and unrecognizable.Thus we propose a shape-adaptive cropping module.Specifically, as shown in Figure 3, we pre-define grids {g = (n h × n w )|n h • n w ≤ N c , n h ∈ N, n w ∈ N} with various shapes, where n h and n w denote the number of rows and columns of the grid g and N c denotes the maximum number of the cells (sub-images).To select a suitable grid for an image I with shape H ×W , two rules should be followed: (1) The grid should preserve the resolution of the image as much as possible, and (2) the grid should fit the aspect ratio of the input image.
To measure the resolution coherence and shape similarity between the image and each grid, we calculate the resolution-related and resolution-agnostic insection over union S rr and S ra as follows: where IoU denotes the insection over the union between two rectangles centered and aligned with each other.The matched grid is selected by maximizing the matching score: where g * is the selected grid.
and d v denote the number and dimension of the extracted visual features, respectively.The visual abstractor further summarizes visual information and obtains higher semantic visual representations V l ∈ R N ×Nq×d l in language feature space by several learnable queries, where d l denotes the dimension of language feature space and N q denotes the number of learnable queries.

Cropped Images Modeling with LLM
MLLMs are mostly trained with a single image as the input.Due to the cropping module, we need to input visual features from multiple images into the language model.The 1-dimensional position embeddings of LLM can not reflect the spatial position of each sub-image, which is critical to correlate local images.Therefore, we incorporate a 2-dimensional crop position encoding to help the language model to understand the spatial relationship between cropped images.Specifically, we assign a location index (i, j) for each cell of the selected grid and obtain their row embedding and column embedding by two auxiliary embedding layers as follows: where e i,j ∈ R D l denotes the crop position embedding of the cell (c i , c j ).We add the embedding to the visual feature of each cell in the language space via broadcasting along the dimension of learnable queries: V l i,j = V l i,j + e i,j .We then reshape the visual features into Vl ∈ R (N •Nq)×d l .The resulting spatial-aware visual features and word embeddings of the input sentences are concatenated at sequence dimension and sent to the large language model.
In order to enhance the language model's ability to effectively model multiple images while keeping low training costs, we freeze the origin language model and adopt the low-rank adaptation approach (LoRA) (Hu et al., 2022).

Instruction Tuning
For developing a universal visually-situated language understanding model that could process various types of images and perform different comprehension tasks, we conduct low-cost instruction tuning with a Multimodal Large Language Model.Without introducing any large-scale pretraining datasets, we directly ensemble multiple downstream datasets and perform joint training.Different downstream tasks are all reorganized to the unified instruction format (Dai et al., 2023).Besides, we design auxiliary text reading and key points generation tasks to enhance text recognition and semantic understanding ability.

Tuning Tasks
Unified downstream task.Downstream tasks of Visuall-situated Language Understanding cover Visual Question Answering, Information Extraction, Natural Language Inference, and Image Captioning.For developing a universal model, we reorganize all tasks into the instruction tuning format (Dai et al., 2023).Concretely, for the Visual Question Answering task, the question is directly used as the instruction: "Human: {question} AI: {an-swer}".For the Information Extraction task, each category and value pair is expressed with a prompt as "Human: What is the value for the {category}?AI: {value}".If some categories don't exist in the image, the value is 'None'.In the raw annotation of the Natural Language Inference task, '1' means 'Entailed' and '0' means 'Refuted'.We reorganize the NLI task by constructing the instruction "Human: {statement}, Yes or No? AI: {answer}", where 'Yes' means 'Entailed'.For the Image captioning task, we refer to 11 prompts from LLaVa (Liu et al., 2023a) to instruct the model to briefly describe the image and randomly choose 1 prompt for each caption, such as "Human: Provide a brief description of the given image.AI: {caption}".Text Reading task.Text Recognition is a basic ability for OCR-free Visuall-situated Language Understanding.Therefore, we apply an auxiliary Text Reading task to strengthen text recognition ability across different domains.With the text and position information in the image, we organize the texts in the common reading order: from top to down, from left to right.Directly utilizing all texts as targets (Kim et al., 2022) will result in the model focusing on generating the starting texts and neglecting others to reduce the loss.Instead, we randomly choose a split position p from {0, L 6 , 2L 6 , ..., 5L 6 }, where L is the text sequence length.The left part is used as the input and the right one is the target.p = 0 means to generate all texts while other cases ask the model to continue reading following the input texts.Such a design could enforce the model to read different parts of texts with the context.Starting texts always convey key information about the image, such as the chart title.Therefore, we apply a bigger sample rate (0.5) for the '0' position and 0.1 for other positions.To distinguish reading from the beginning and continuing reading, we design two groups of prompts and randomly choose 1 prompt for each sample.For example, an instruction of reading from the beginning can be "Human: Recognize text in the image.AI: {all texts}" and an instruction of continuing reading can be "Human: The words on this picture are {left texts}.Continue reading the text.AI: {right texts}".Key Points Generation task.Large Language Models learn strong understanding ability from the tough language modeling task.Therefore, for stronger vision-and-language semantic comprehension ability, we propose to design an auxiliary Key Points Generation task, which requires the model to give some key points about the image.To support this task, we collect QA pairs of each image and convert them to declarative sentences with Vicuna (Vicuna, 2023).These declarative sentences are finally regarded as key points about the image.We also build a set of templates to instruct this task, such as "Human: Identify some key points in this picture.AI: {key points}".
All templates for Text Reading and Key Points Generation tasks can be found in Appendix D.

Implementation Details
We conduct experiments on a recently proposed MLLM named mPLUG-Owl (Ye et al., 2023) without modifying its hyperparameters.The number of learnable queries of visual abstractor is 65.The dimension of hidden states d v and d l are 1024.For the shape-adaptive cropping module, we set the

Evaluation
We use official training splits as tuning data and evaluate models on test splits.Following previous works (Borchmann et al., 2021;Lee et al., 2022), DocVQA and InfoVQA are evaluated by ANLS (Biten et al., 2019), DeepForm and KLC are evaluated by F1 score.WTQ, TabFact and TextVQA are evaluated by accuracy.ChartQA is evaluated with the relaxed accuracy (Methani et al., 2020).
TextCaps and VisualMRC are measured by CIDEr (Vedantam et al., 2015).Evaluation of TextVQA and TextCaps are performed with the official challenge website.

Main Results
We first compare UReader with state-of-the-art ocrfree models on 10 datasets.For fair and consistent comparison across all datasets, we finetune the strong and accessible baseline Dount on unreported datasets.As shown in

Ablation Study
We perform comprehensive ablation experiments to validate the contribution of two auxiliary tasks, trainable architectures, cross-domain joint training and the design of shape-adaptive cropping module.Shape-adaptive Cropping.The r6 in Table 2 represents directly tuning the mPLUG-Owl without any model revisions.With the shape-adaptive cropping, UReader achieves significantly better performance (r7 vs r6), showing that our cropping module is indispensable to leverage pretrained lowresolution vision encoder for universal visuallysituated language understanding.Besides, increasing the cropping numbers (r8 vs r7) improves the model's performance.Due to the resolution of each local image being constant (224x224), more crops mean higher overall resolution and therefore achieves better performance.Furthermore, adding a resized global image bring a slight improvement in most datasets (r10 vs r8), validating that a complete image could alleviate possible information loss due to image cropping.Finally, dropping crop position encoding also hurts the model's perfor-mance (r10 vs r9), proving the effectiveness of crop position encoding for correlating local images.

Auxiliary Tasks. As shown in
For alleviating the distortion problem due to resizing, we propose to crop images according to their raw aspect ratio.Figure 4 shows the frequency distribution of grids selected by our shape-adaptive cropping module on DocVQA, Vi-sualMRC and WikiTableQuestions (the distribution on more datasets can be found in the Appendix A).For aesthetic purposes, we present the distribution with N c = 9.Apparently, different domains of images have different shape distributions.For most document images in DocVQA, their height is greater than the width, while table images are the opposite.As webpages are scrollable, their screenshots are always in the form of a long rectangular shape.With the shape-adaptive cropping design, our model can easily adapt to various image shapes without domain-specific fine-tuning.
Text distortion may pose little influence on visual question answering because they are always about partial text information.But it is harmful for reading texts in the image because every text matters.For quantitative analysis of the influence of shape-adaptive design, we directly evaluate the performance of reading all texts.We choose the Bleu (Papineni et al., 2002) as the metric because it directly measures the n-gram overlap between the ground-truth and predicted text sequence.The evaluation set is built by combining 100 randomlyselected test images from each dataset.As shown in Table 3, compared with cropping all images with a fixed grid, UReader could better recognize texts Table 3: The Text Reading performance of UReader under the condition of N c = 9. 'w/o adapt means removing the shape-adaptive design and cropping the image with a fixed grid 3 × 3.

Qualitative Results
Figure 5 show some qualitative results produced by our UReader on different types of images.UReader could not only extract information from the document (case a), but also understand different instructions and provide corresponding answers by attending to different regions (case b).manner, when given an image with rich texts, such as a page of a book, the model often reads the beginning texts and then continues writing without watching the image.More qualitative results can be found in Appendix C. Finally, as shown in case f, UReader is able to list some key points about the chart by combining the title and line information.
Listing key points in this work is just a superficial attempt at open-ended generation, and its performance is far from promising, e.g., UReader makes a mistake about the lowest line.More effort is needed towards a comprehensive understanding of images with rich text.

Conclusion
We first propose to leverage existing Multimodal Large Language Models for universal ocr-free visually-situated language understanding through low-cost instruction tuning.All downstream tasks are reorganized into a unified instruction-tuning format.Besides, we design the Text Reading task and Key Points Generation task to enhance text recognition and vision-and-language semantic comprehension abilities.To utilize the pre-trained vision encoder for processing high-resolution images, we design a shape-adaptive cropping module, which cuts the image into multiple local images considering its raw aspect ratio and resolution.UReader achieve state-of-the-art ocr-free performance in 8 out of 10 datasets, ranging from documents, tables, charts, and natural images to webpage screenshots.

Limitations
Our experiments validate that UReader is able to correlate local images after cropping a highresolution image.However, UReader struggles to understand multi-page documents (e.g.books and papers) due to lacking ability to correlate different pages and the limited sequence length of the decoder.Besides, UReader feeds an equal number of features for each local image into the language decoder.But, not all local images contain rich vision or text information.In the future, we will explore a more efficient way to encode different crops.
Furthermore, the open-ended generation about Visually-situated Language understanding is far from well studied.We try developing key points generation ability in this work but more difficult generation tasks are not currently considered, such as giving the chain-of-the-thought of the answer.
How to simulate such abilities through instruction tuning is a topic worth studying.Finally, the Text Reading task helps the model recognize texts, but the text reading performance with the LLM as the decoder is far from satisfactory due to the hallucination problem.Instructing the LLM to read texts strictly according to images is a challenging topic.

Ethics Statement
Our UReader relies on multi-modal large language models that are trained on large-scale image and text data from the web and therefore may be subject to issues such as toxic language and bias (Bender et al., 2021).However, our model is further finetuned on publicly available datasets and is used specifically in the domain of visually-situated language understanding, where these issues have minimal impact.

A Grid Distribution on Downstream Datasets
We visualize the frequency distribution of grids selected by our shape-adaptive cropping module on all ten datasets in Figure 6.The wide variety of image shapes in downstream tasks highlights the crucial role of the shape-adaptive cropping module.
B Detailed Analysis on Performance (2) The pretraining task of Pix2struct is to predict the HTML dom tree of a masked web screenshot, which requires the model to fully understand the layout information of the image.But UReader is trained to read texts from top to down, from left to right, which requires a weaker layout understanding ability.The pretraining on layout understanding also leads to improved performance on DocVQA.
The conclusion can also be substantiated by the observations on the other two datasets (i.e., In-foVQA and KLC) included in the document domain as previous work (Tang et al., 2023).For the InfoVQA dataset, the image is poster style and the layout is not as important as DocVQA and Deep-Form but the relationship between text and vision objects matters more, like natural image and chart image.As for the KLC dataset, ocr-free models are only fed with the first page (always the cover of a report) , where the layout is much simpler than DocVQA and DeepForm.Therefore, UReadercan outperform baselines on these two document datasets.
In summary, compared with ocr-free model Donut and Pix2Struct, due to the pretrianing of MLMM on open-domain datasets, UReaderis better at understanding cross-modality relationships in the image but weaker at comprehending text layout information without large-scale document pretraining and specific layout understanding tasks.

B.2 Compared with Pipeline Methods
We list the performance of state-of-the-art pipeline models in Table 4.We can summarize from the results that there are two distinct aspects.Firstly, our model achieves comparable or slightly worse results compared to the pipeline methods on TextVQA, ChartQA, InfoVQA, TextCaps and Tab-Fact.Secondly, there is a obvious gap between our model and pipeline methods on DocVQA, Deep-Form, KLC, WTQ and VisualMRC.
For the first aspect, there are two reasons for the similarity performance: (1) Modeling the diverse relationship between visual objects and text presents challenges for both pipeline-based methods and OCR-free methods.TextVQA, TextCaps and InfoVQA requires the relation understanding between text and visual objects (i.e.logos, icons and common objects).ChartQA asks for trend comprehension of lines.Understanding such complex cross-modality relation is challenging for both ocrfree and pipeline methods.(2) The simplicity of task formats can reduces performance gaps.Tabfact is a simply binary classification task resulting the small performance gap.
For this second aspect, the main performance gap appears in three categories of datasets: document, table, and webpage screenshot.The reasons are two folds: (1) The gap in terms of text recognition and layout extraction.In document, table and website, text is the dominant information source and the layout(e.g.row and column layout in table) is relatively uniformer than the chart and natural images.Therefore, with pre-extracted texts and layout information, it is more easy to understand the image.But for OCR-Free models, such as our UReader and Donut, it's still challenging to fully recognize all texts.(2) The gap in terms of modeling capacity on multi-page document input.for multiple-page document datasets KLC (98% > 4 pages) and DeepForm (75% > 1 pages), OCR-Free models only input the first page and lose much information.

B.3 Zero-shot Performance
We test the zero-shot performance of UReader on unseen dataset OCR-VQA.With the same evaluation metrics, UReader outperforms mPLUG-Owl (41.1 vs 28.6) and a recent work UniDoc (Feng et al., 2023) (41.1 vs 34.5) with the training of layout prediction.The results show that the zero-shot performance of our method on unseen domains is  acceptable.

C.1 Downstream Results
More qualitative results on natural images, charts, tables, documents and webpage screenshots are shown in Figure 7-11.
Figure 11 show a sample of Text Reading and Visual Question Answering about a webpage screenshot from VisualMRC.As mentioned in Section 5.5, when given an instruction about reading all texts in the image, UReader can read the beginning texts but sometimes is easy to continue to generate vision-irrelevant texts.With appropriate instructions, UReader could indeed recognize texts in other regions, such as 'exercise increases cellular recycling'.Therefore, the hallucination problem during text reading is not because UReader cannot recognize texts, but the generating manner of LLM decoder.When beginning texts are read from the image, the decoder may generate the following texts according to the closer text context rather than the image.

C.2 Open-domain Results
We present open-domain examples in Figure 12.We use randomly collected images and freely ask questions to the model based on the content of these images.The original mPLUG-Owl is used for comparison.
In Figure 12 (a), UReader is able to accurately recognize and answer questions about the small text in natural images ("Name of passenger" and "MORRIS/KARLA").In contrast, mPLUG-Owl does not respond with the name in the first round and gives an incorrect answer even with a prompt in the second round.
In Figure 12 (b), we raise a query consisting of two cascaded questions, which requires the model to simultaneously understand the spatial position of the non-textual objects referred to by the query and locate the corresponding fields.It can be seen that the UReader completes this task well, while mPLUG-Owl answers incorrectly in both object recognition and price extraction.
In Figure 12 (c), we conduct multi-turn conversions with UReader on a screenshot.The questions included references to the history of the conversation.Some questions also require a certain amount of common sense.For example, the time of account creation is equivalent to the time of joining Twitter, and the inactive state of the Follow button indicates that the user has not followed Iron Musk.UReader answers these questions well.mPLUG-Owl can correctly recognize the Iron Mush, but is it prone to generating content that is unrelated to the image,

D Instruction Templates
The instruction templates of the auxiliary Text Reading and Key Points Generation tasks are shown in Table 5.

Figure 4 :
Figure4: Visualization of the frequency of selected grid with shape-adaptive cropping module.The cell at row i and column j denotes the selected frequency of grid (n h = i, n w = j).Deeper colors represent higher selection frequencies.

Human:Figure 5 :
Figure 5: Qualitative results of UReader.Crucial regions are enlarged for clearer visualization.

Figure 6 :
Figure 6: Visualization of the frequency of selected grid with the shape-adaptive cropping module on 10 downstream datasets.

Human:
Detect words in the photograph: UReader: ARTICLES WITH THE TOPIC: AUTISM Could molecular handcuffs lower the protein that causes autism?By Timothy D. Rayne | May 20, 2012 Huntington's disease is a neurodegenerative disorder that causes a progressive decline in cognitive unction and motor control.It is caused by a mutation in a gene that codes for a protein called huntingtin…… Human: What is suggested that molecular handcuffs could lower?UReader: The protein that causes Huntington's disease Human: What does exercise increase?UReader: It increases cellular recycling.

Figure 11 :
Figure 11: Text Reading and Visual Question Answering performance of UReader on a webpage screenshot from VisualMRC.Correct and wrong answers are colored green and red, respectively.
Then, we resize the input image to (n h H v , n w W v ) and crop it to n h • n w local images.To maintain the global structure information of the image, we also resize the input image to (H v , W v ) as a global image.All images are then passed on to the visual encoder and visual abstractor in parallel.The visual encoder extracts visual feature are two Information Extraction datasets.DeepForm * contains 1.1k documents related to election spending.2.7k documents of KLC come from published reports of charity organizations.Table.WikiTableQuestions (WTQ * ) (Pasupat and Liang, 2015) comprises 2.1k table images from Wikipedia and is annotated with 23k question and answer pairs demanding comparison and arithmetic operations.TabFact * (Chen et al., 2020) is a Natural Language Inference dataset, which contains 112k 'entailed' or 'refuted' statements about 16k Wikipedia tables.Chart.ChartQA (Masry et al., 2022) collects various topics and types of charts from four sources: Statista (statista.com),The Pew research (pewresearch.org),OWID (ourworldindata.org)and OECD (oecd.org).It totally contains 21k chart images and 32k QA pairs.

Table 1 :
Comparison with ocr-free methods on various types of visually-situated language understanding tasks.'TSFT'means task-spcific fine-tuning on the downstream dataset.'underline'means achieving 80% SOTA performance.maximumnumber of cells N c to 20 by default.During instruction tuning, the maximum sequence length is limited to 2048, and H v , W v are set to 224 to match the pretrained resolution of the visual encoder.For LoRA, we set the rank r = 8.The learning rate schedule uses a linear warmup of 36 steps to 1e −4 , followed by cosine decay to 0. The batch size is set to 256.For better convergence of each dataset, DocVQA is up-sampled 3 times, InfoVQA, WTQ, DeepForm, and KLC are up-sampled 2 times.The instruction tuning process takes 16 A100 days for 20k training steps (10 epochs).

Table 2 :
Ablation study about auxiliary training tasks, trainable model architectures, cross-domain joint training and shape-adaptive cropping.'KPG' and 'TR' refer to Key Points Generation and Text Reading tasks, respectively.'Abs' refers to the visual abstractor.'Doc Data' means using 4 document datasets as training data or not.'Global' means using a resized global image as input.'Crops' refers to N c , the maximum number of local images after cropping.'CropPos' refers to the crop position embedding.

Table un
vant paragraph, understand the texts and answer the question accurately.Case d shows the text reading performance.With the help of the Text Reading task, UReader is able to read texts from top left to bottom right.But, due to the language decoding

Table 4 :
Performance comparison between UReaderand state-of-the-art pipeline methods.