LOCR: Location-Guided Transformer for Optical Character Recognition

Yu Sun, Dongzhan Zhou, Chen Lin, Conghui He, Wanli Ouyang, Han-Sen Zhong


Abstract
Academic documents are packed with texts, equations, tables, and figures, requiring comprehensive understanding for accurate Optical Character Recognition (OCR). While end-to-end OCR methods offer improved accuracy over layout-based approaches, they often grapple with significant repetition issues, especially with complex layouts in Out-Of-Domain (OOD) documents.To tackle this issue, we propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on an original large-scale dataset comprising over 53M text-location pairs from 89K academic document pages, including bounding boxes for words, tables and mathematical symbols. LOCR adeptly handles various formatting elements and generates content in Markdown language. It outperforms all existing methods in our test set constructed from arXiv.LOCR also eliminates repetition in the arXiv dataset, and reduces repetition frequency in OOD documents, from 13.19% to 0.04% for natural science documents. Additionally, LOCR features an interactive OCR mode, facilitating the generation of complex documents through a few location prompts from human.
Anthology ID:
2024.findings-emnlp.314
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5480–5497
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.314
DOI:
Bibkey:
Cite (ACL):
Yu Sun, Dongzhan Zhou, Chen Lin, Conghui He, Wanli Ouyang, and Han-Sen Zhong. 2024. LOCR: Location-Guided Transformer for Optical Character Recognition. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5480–5497, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
LOCR: Location-Guided Transformer for Optical Character Recognition (Sun et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.314.pdf