Efficient OCR for Building a Diverse Digital History

Jacob Carlson, Tom Bryan, Melissa Dell


Abstract
Many users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) – which jointly learns a vision and language model – is poorly extensible to low-resource document collections, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character level image retrieval problem, using a contrastively trained vision encoder. Because the model only learns characters’ visual features, it is more sample efficient and extensible than existing architectures, enabling accurate OCR in settings where existing solutions fail. Crucially, it opens new avenues for community engagement in making digital history more representative of documentary history.
Anthology ID:
2024.luhme-long.440
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8105–8115
Language:
URL:
https://aclanthology.org/2024.luhme-long.440/
DOI:
10.18653/v1/2024.acl-long.440
Bibkey:
Cite (ACL):
Jacob Carlson, Tom Bryan, and Melissa Dell. 2024. Efficient OCR for Building a Diverse Digital History. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8105–8115, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Efficient OCR for Building a Diverse Digital History (Carlson et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.440.pdf