A tailored Handwritten-Text-Recognition System for Medieval Latin

Philipp Koch; Gilary Vera Nuñez; Esteban Garces Arias; Christian Heumann; Matthias Schöffel; Alexander Häberlin; Matthias Aßenmacher

A tailored Handwritten-Text-Recognition System for Medieval Latin

Philipp Koch, Gilary Vera Nuñez, Esteban Garces Arias, Christian Heumann, Matthias Schöffel, Alexander Häberlin, Matthias Assenmacher

Abstract

The Bavarian Academy of Sciences and Humanities aims to digitize the Medieval Latin Dictionary. This dictionary entails record cards referring to lemmas in medieval Latin, a low-resource language. A crucial step of the digitization process is the handwritten text recognition (HTR) of the handwritten lemmas on the record cards. In our work, we introduce an end-to-end pipeline, tailored for the medieval Latin dictionary, for locating, extracting, and transcribing the lemmas. We employ two state-of-the-art image segmentation models to prepare the initial data set for the HTR task. Further, we experiment with different transformer-based models and conduct a set of experiments to explore the capabilities of different combinations of vision encoders with a GPT-2 decoder. Additionally, we also apply extensive data augmentation resulting in a highly competitive model. The best-performing setup achieved a character error rate of 0.015, which is even superior to the commercial Google Cloud Vision model, and shows more stable performance.

Anthology ID:: 2023.alp-1.12
Volume:: Proceedings of the Ancient Language Processing Workshop
Month:: September
Year:: 2023
Address:: Varna, Bulgaria
Editors:: Adam Anderson, Shai Gordin, Bin Li, Yudong Liu, Marco C. Passarotti
Venues:: ALP | WS
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:: 103–110
Language:
URL:: https://aclanthology.org/2023.alp-1.12/
DOI:
Bibkey:
Cite (ACL):: Philipp Koch, Gilary Vera Nuñez, Esteban Garces Arias, Christian Heumann, Matthias Schöffel, Alexander Häberlin, and Matthias Assenmacher. 2023. A tailored Handwritten-Text-Recognition System for Medieval Latin. In Proceedings of the Ancient Language Processing Workshop, pages 103–110, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):: A tailored Handwritten-Text-Recognition System for Medieval Latin (Koch et al., ALP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.alp-1.12.pdf

PDF Cite Search Fix data