Handwritten Text Recognition (HTR) for Irish-Language Folklore

Brian Ó Raghallaigh, Andrea Palandri, Críostóir Mac Cárthaigh


Abstract
In this paper we present our method for digitising a large collection of handwritten Irish-language texts as part of a project to mine information from a large corpus of Irish and Scottish Gaelic folktales. The handwritten texts form part of the Main Manuscript Collection of the National Folklore Collection of Ireland and contain handwritten transcriptions of oral folklore collected in Ireland in the 20th century. With the goal of creating a large text corpus of the Irish-language folktales contained within this collection, our method involves scanning the pages of the physical volumes and digitising the text on these pages using Transkribus, a platform for the recognition of historical documents. Given the nature of the collection, the approach we have taken involves the creation of individual text recognition models for multiple collectors’ hands. Doing it this way was motivated by the fact that a relatively small number of collectors contributed the bulk of the material, while the differences between each collector in terms of style, layout and orthography were difficult to reconcile within a single handwriting model. We present our preliminary results along with a discussion on the viability of using crowdsourced correction to improve our HTR models.
Anthology ID:
2022.cltw-1.17
Volume:
Proceedings of the 4th Celtic Language Technology Workshop within LREC2022
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Theodorus Fransen, William Lamb, Delyth Prys
Venue:
CLTW
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
121–126
Language:
URL:
https://aclanthology.org/2022.cltw-1.17
DOI:
Bibkey:
Cite (ACL):
Brian Ó Raghallaigh, Andrea Palandri, and Críostóir Mac Cárthaigh. 2022. Handwritten Text Recognition (HTR) for Irish-Language Folklore. In Proceedings of the 4th Celtic Language Technology Workshop within LREC2022, pages 121–126, Marseille, France. European Language Resources Association.
Cite (Informal):
Handwritten Text Recognition (HTR) for Irish-Language Folklore (Ó Raghallaigh et al., CLTW 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.cltw-1.17.pdf