Transferring General Multimodal Pretrained Models to Text Recognition

This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API. The code (https://github.com/OFA-Sys/OFA) and demo (https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary) are publicly available.


Introduction
Optical character recognition (OCR) plays an important role in the real-world applications.It helps users or developers extract text contents from different types of images, including photos, scanned documents, etc.In practice, building a tool for OCR needs a pipeline consisting of a text localization module and a text recognition module.
In this work, we focus on improving the accuracy of text recognition.Text recognition has often been regarded as a key challenge owing to the room for improvements in recognition accuracy.In the deep learning era, the classical methods are mostly based on CNN and RNN, which are responsible for visual feature extraction and sequence modeling, respectively (Shi et al., 2017a(Shi et al., , 2019;;Luo et al., 2019).Recently, with the rise of Transformer (Vaswani et al., 2017), researchers applied the Transformer encoder-decoder framework to text recognition and achieved outperforming results over the baselines (Li et al., 2021;Lyu et al., 2022).However, most methods are based on largescale pretraining on human-annotated or synthetic OCR data.It is hard for other researchers to collect or create such data for reproduction.Furthermore, the methods often include complex model or objective designs, like DETR-like decoder (Carion et al., 2020), CTC loss (Graves et al., 2006), etc.These components also might hinder reproduction as they increase the difficulty in training.Therefore, we naturally raise a question: Is there any way to achieve high recognition accuracy without complex designs on data and model?
Inspired by the recent progress in multimodal pretraining, we argue that the transfer of a unified multimodal pretrained model is a possible solution.Multimodal pretraining has proved significant to the performance of downstream tasks, and thanks to the rise of unified multimodal pretrained models, they can perform both cross-modal understanding and generation and achieve state-of-theart performance (Wang et al., 2022a,b;Lu et al., 2022).We therefore propose to transfer the unified multimodal pretrained model by finetuning the pretrained model on the text recognition datasets with the task of image captioning, which is essentially a simple sequence-to-sequence learning task with maximum likelihood estimation for optimization.
To support the effectiveness of the proposed method, we have conducted extensive experiments on the Chinese text recognition benchmark (Chen et al., 2021b) covering multiple scenarios, including scene, web, document, and handwriting.Specifically, we finetune the open-source Chinese multimodal pretrained model OFA (Wang et al., 2022a) on text recognition, and we name the model OFA-OCR.Figure 1 demonstrates the results of methods with or without general-domain pretraining.It shows that multimodal pretraining on generaldomain vision-language data can effectively boost downstream performance in text recognition.To achieve the best performance, we apply the multitask + single-task finetuning to OFA-OCR, and it outperforms the previous state-of-the-art methods on the benchmark.Furthermore, through the ablation studies, we demonstrate the effectiveness of our method designs, including multitask + singletask finetuning, data augmentation, etc.Furthermore, to enable deployment for real-world applications, we construct a pipeline with both OFA-OCR and a simple text localization module.We find that this simple pipeline can provide high-quality OCR performance, competitive with a productlevel API.

Method 2.1 Pretraining
To leverage the capability of the multimodal pretrained model for image captioning, we employ the unified multimodal pretrained model architecture.Specifically, we implement our models on OFA (Wang et al., 2022a), an open-source state-ofthe-art unified multimodal pretrained model with the release of Chinese models.
The model is mainly based on the Transformer encoder-decoder framework (Vaswani et al., 2017).To make information from different modalities adaptable to the Transformer, there are adaptors for images and texts, which are visual backbones, e.g., ResNet (He et al., 2016), ViT (Dosovitskiy et al., 2021), etc., and word embeddings, respectively.The information from modalities is encoded as discrete tokens so that the decoder can perform their generation.
For Chinese multimodal pretraining, OFA-Chinese was pretrained on a large-scale dataset, which consists of LAION-5B (Schuhmann et al., 2022), Wukong dataset, as well as translated datasets from MSCOCO (Chen et al., 2015), Visual Genome (Krishna et al., 2017), VQA (Goyal et al., 2017), RefCOCO (Yu et al., 2016), etc.Note that this work is different from previous pretraining-related methods, which pretrain the model on large-scale human-annotated or synthetic data.We show that through pretraining on generaldomain data, the model can obtain the potential of text recognition by finetuning on small datasets.

Finetuning with Image Captioning
It is natural to recast text recognition as image captioning, as text recognition also requires the model to generate a piece of text based on the input image.It is equivalent to finetuning on different image captioning datasets, where the target refers to the text on the image.We finetune the model with maximum likelihood estimation for optimization.
Furthermore, to better alleviate the discrepancy between upstream and downstream data, we apply a transformation to the input images to make them square, e.g., a resolution of 480 × 480.Specifically, we first resize the image to a longer edge of the specified resolution while keeping the original height-width ratio of the image, and we make the image square by padding on all sides with the edge value.The lengths for the directions are random, and thus this method can play as data augmentation in this context.We demonstrate the pseudo code in Sec.A.3.
For better performance in the downstream tasks, we often use a larger resolution in the finetuning stage, and thus we encounter issues with the positional embedding.In our practice, we still use the same one from pretraining but apply interpolation to adapt to images of a larger resolution.

Multitask Finetuning
There are multiple subtasks in text recognition, concerning different scenarios, e.g., scene, document, etc.Our experiments are implemented on the Chinese text recognition benchmark consisting of 4 subtasks.In our practice, we implement multitask finetuning and single-task finetuning for comparison.Specifically, as the data of all subtasks are organized with the same format, we directly build a mixture of datasets for multitask finetuning.We find that directly applying multitask finetuning can help OFA-OCR achieve outstanding performance on all datasets.To further boost its performance, we additionally apply single-task finetuning after Metrics Scene Web Document Handwriting Average CRNN (Shi et al., 2017a) 53.4 54.5 97.5 46.4 67.0 ASTER (Shi et al., 2019) 54.5 52.3 93.1 38.9 64.7 MORAN (Luo et al., 2019) 51.8 49.9 95.8 39.7 64.3 SAR (Li et al., 2019) 62 multitask finetuning, and we find that this pushes its performance to the new state-of-the-art.

Datasets and Metrics
We implement OFA-OCR on the Chinese text recognition benchmark (Chen et al., 2021b).This benchmark consists of multiple subtasks of text recognition, which are text recognition in different scenarios, including scene, web, document, and handwriting.The details of the datasets are provided in Sec.A.1.The evaluation metric includes accuracy, which refers to the ratio of exact match.

Experimental Results
The experimental results are demonstrated in Table 1.We compare our method with baseline models of OCR, including the previous state-of-the-art MaskOCR (Lyu et al., 2022).It can be found that with no regard to the scale of models, the base-size OFA-OCR, which is finetuned from the pretrained Chinese OFA Base , can outperform both the basesize and large-size MaskOCR models.Specifically, it shows the advantages of 9.0, 6.9, and 5.3 absolute improvements in the scenarios of scene, web, and handwriting.On average, the base-size OFA-OCR outperforms the base-size MaksOCR by 5.2 and the large-size MaskOCR by 3.4.Scaling up the model size can consistently bring steady improvement in the downstream performance.On average, OFA Large reaches the best results of 86.3.Specifically, we find that the advantage in the scene dataset is the largest among the tasks.This may be attributed to the pretraining on generaldomain data, where there are images of street views, and some of them might contain texts.Similarly, the pretraining dataset consists of web images that resemble those in the web dataset, and thus the gaps between OFA-OCR and the previous methods are large.However, text recognition for documents should be a simpler task as the texts are more regular in fonts and there is often much less noise in the background.Thus, even the conventional method like CRNN can achieve a high accuracy.

Ablation Study of Training Strategies
To check how the multitask learning influences the final performance, we conduct an ablation study to evaluate its effects.Specifically, the experiments are conducted with the base-size OFA-OCR.We provide experiments in 4 setups, which are training from scratch (scratch), single-task finetuning (ft), multitask-finetuning (mt), and multitask + singletask finetuning (mt+ft), respectively.Experimental results are shown in Figure 2. It can be found that on average, the addition of the initialization of the pretrained OFA model significantly boosts the performance on the datasets.Surprisingly, multitask finetuning alone can outperform single-task finetuning on all 4 tasks, and the advantage in the web dataset is the most obvious.We assume that this is attributed to the small amount of supervised training data for downstream transfer.A mixture of datasets of related subtasks can encourage performance on all subtasks.Furthermore, the combination of multitask finetuning and single-task finetuning is the best solution owing to its outstanding performance, while multitask finetuning on the mixture of datasets is the most cost-efficient.

Ablation Study of Data Augmentation
The preprocessing of images for this task can play as data augmentation.To validate its effects, we use a simple resizing to the specified resolution as a baseline.We also implement experiments on the 4 datasets, and for simplicity we implement the experiments in the setup of single-task finetuning on the base-size models.Results are demonstrated in Table 2.We use "Aug." to indicate the preprocessing method mentioned in Sec. 2. The results indicate that the introduced technique for data preprocessing can effectively boost the performance.

Deployment
To construct an OCR system applicable in realworld scenarios, a strong text recognition model is not sufficient, and we need to build a pipeline with both the text detection and text recognition module.While the former one is not the focus of this research, we directly use a light-weight model from EasyOCR3 for detection.After detecting all the bounding boxes which possibly contain texts, we crop them with boxes to create a batch of new images.The final step is to process the images with OFA-OCR for the generation of text recognition results.Through our case study, we find that the simple OCR pipeline based on OFA-OCR can achieve competitive performance with the productlevel API.Examples are demonstrated in Sec.A.4.

Related Work
We focus on the review of text recognition methods and multimodal pretraining.effectiveness (Shi et al., 2017a;Luo et al., 2019;Shi et al., 2019;Yu et al., 2020;Li et al., 2019;Fang et al., 2021).Recent methods have turned to the use of Transformer and achieved improved performance (Atienza, 2021;Li et al., 2021;Zhang et al., 2022;Lyu et al., 2022).However, before this work, we have not witnessed the direct transfer of general-domain vision-language pretrained models to text recognition.Vision-language pretraining has proved a success as it has leveled up the model performance on a series of downstream tasks (Chen et al., 2019;Lu et al., 2019;Radford et al., 2021;Wang et al., 2021), and the unified models capable of both understanding and generation have become popular and achieved the best performance (Wang et al., 2022a,b).Yet, there are only a few unified multimodal pretrained models in Chinese (Lin et al., 2021;Wang et al., 2022a).

Conclusion
In

Limitations
This section discusses the limitations of this work for more insights on the research in this track.Though OFA-OCR achieves high accuracy on multiple text recognition datasets, its costs are larger than the non-Transformer baselines.In practice, it is difficult to deploy such large models.Thus in our future work, we will discover how to distill or compress OFA-OCR to a light-weight model with high efficiency.

Ethics Statement
Our method is essentially based on a generation model, and thus the OCR results should be taken as AI-generated contents.As the generated results should be aligned with the input, we have not noticed deliberate harmful contents, e.g., hate speech, bias, etc.However, the model maintains such ability, which might be triggered.Although after finetuning on the public datasets the risk of such phenomena is extremely low, we still take it into account.In the future research, besides focusing on improving downstream performance, we will study how to increase the controllability on the generation.Through the case study, we find that OFA-OCR can reach a competitive performance.

Figure 1 :
Figure1: An comparison between the performance with or without general vision-language pretraining.On 4 subtasks of text recognition, OFA-OCR with general-domain vision-language pretraining significantly outperforms the baseline without one.

Figure 2 :
Figure2: Performance of OFA-OCR in different setups.We validate the model performance on the 4 datasets in the setups of training from scratch (scratch), single-task finetuning (ft), multitask-finetuning (mt), and multitask + single-task finetuning (mt+ft).We observe consistent performance growth with the addition of the pretrained weight initialization and multitask finetuning.

Figure 3 :
Figure 3: A Case study of different OCR demos.We compare a product-level API (a) with OFA-OCR (b).Through the case study, we find that OFA-OCR can reach a competitive performance.

Table 1 :
Experimental results on the Chinese text recognition benchmark.Results show that the base-size OFA-OCR model can outperform the previous state-of-the-art, and the large-size model achieves the best performance on average.

Table 2 :
Performance comparison with or without data augmentation for images.The experiments are conducted in the setup of single-task finetuning on the base-size model.