Transferring General Multimodal Pretrained Models to Text Recognition

Junyang Lin; Xuancheng Ren; Yichang Zhang; Gao Liu; Peng Wang; An Yang; Chang Zhou

doi:10.18653/v1/2023.findings-acl.37

Transferring General Multimodal Pretrained Models to Text Recognition

Junyang Lin, Xuancheng Ren, Yichang Zhang, Gao Liu, Peng Wang, An Yang, Chang Zhou

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API.

Anthology ID:: 2023.findings-acl.37
Volume:: Findings of the Association for Computational Linguistics: ACL 2023
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 588–597
Language:
URL:: https://aclanthology.org/2023.findings-acl.37/
DOI:: 10.18653/v1/2023.findings-acl.37
Bibkey:
Cite (ACL):: Junyang Lin, Xuancheng Ren, Yichang Zhang, Gao Liu, Peng Wang, An Yang, and Chang Zhou. 2023. Transferring General Multimodal Pretrained Models to Text Recognition. In Findings of the Association for Computational Linguistics: ACL 2023, pages 588–597, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Transferring General Multimodal Pretrained Models to Text Recognition (Lin et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-acl.37.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{lin-etal-2023-transferring,
    title = "Transferring General Multimodal Pretrained Models to Text Recognition",
    author = "Lin, Junyang  and
      Ren, Xuancheng  and
      Zhang, Yichang  and
      Liu, Gao  and
      Wang, Peng  and
      Yang, An  and
      Zhou, Chang",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.37/",
    doi = "10.18653/v1/2023.findings-acl.37",
    pages = "588--597",
    abstract = "This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="lin-etal-2023-transferring">
    <titleInfo>
        <title>Transferring General Multimodal Pretrained Models to Text Recognition</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Junyang</namePart>
        <namePart type="family">Lin</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Xuancheng</namePart>
        <namePart type="family">Ren</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Yichang</namePart>
        <namePart type="family">Zhang</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Gao</namePart>
        <namePart type="family">Liu</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Peng</namePart>
        <namePart type="family">Wang</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">An</namePart>
        <namePart type="family">Yang</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Chang</namePart>
        <namePart type="family">Zhou</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2023-07</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Findings of the Association for Computational Linguistics: ACL 2023</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Anna</namePart>
            <namePart type="family">Rogers</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Jordan</namePart>
            <namePart type="family">Boyd-Graber</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Naoaki</namePart>
            <namePart type="family">Okazaki</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Toronto, Canada</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API.</abstract>
    <identifier type="citekey">lin-etal-2023-transferring</identifier>
    <identifier type="doi">10.18653/v1/2023.findings-acl.37</identifier>
    <location>
        <url>https://aclanthology.org/2023.findings-acl.37/</url>
    </location>
    <part>
        <date>2023-07</date>
        <extent unit="page">
            <start>588</start>
            <end>597</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Transferring General Multimodal Pretrained Models to Text Recognition
%A Lin, Junyang
%A Ren, Xuancheng
%A Zhang, Yichang
%A Liu, Gao
%A Wang, Peng
%A Yang, An
%A Zhou, Chang
%Y Rogers, Anna
%Y Boyd-Graber, Jordan
%Y Okazaki, Naoaki
%S Findings of the Association for Computational Linguistics: ACL 2023
%D 2023
%8 July
%I Association for Computational Linguistics
%C Toronto, Canada
%F lin-etal-2023-transferring
%X This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API.
%R 10.18653/v1/2023.findings-acl.37
%U https://aclanthology.org/2023.findings-acl.37/
%U https://doi.org/10.18653/v1/2023.findings-acl.37
%P 588-597

Download as File

Markdown (Informal)

[Transferring General Multimodal Pretrained Models to Text Recognition](https://aclanthology.org/2023.findings-acl.37/) (Lin et al., Findings 2023)

Transferring General Multimodal Pretrained Models to Text Recognition (Lin et al., Findings 2023)

ACL

Junyang Lin, Xuancheng Ren, Yichang Zhang, Gao Liu, Peng Wang, An Yang, and Chang Zhou. 2023. Transferring General Multimodal Pretrained Models to Text Recognition. In Findings of the Association for Computational Linguistics: ACL 2023, pages 588–597, Toronto, Canada. Association for Computational Linguistics.