Nicholas Wolf


2024

pdf bib
“To Have the ‘Million’ Readers Yet”: Building a Digitally Enhanced Edition of the Bilingual Irish-English Newspaper an Gaodhal (1881-1898)
Oksana Dereza | Deirdre Ní Chonghaile | Nicholas Wolf
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024

This paper introduces the ‘An Gaodhal’ project, which aims to serve the historically under-resourced and endangered language of Irish (known as Gaeilge) by providing new digital tools and resources. The initial goal of the project was the extraction of full text of ‘An Gaodhal’, a monthly bilingual Irish-English newspaper produced from 1881 to 1898, to the highest possible degree of accuracy via Optical Character Recognition (OCR), with a view to making its printed content searchable. The methodology applied toward achieving this goal yielded additional digital outputs including: 1. a new OCR model for the Irish language as printed in Cló Gaelach type; 2. a new OCR model for bilingual Irish-English content printed in Cló Gaelach and Roman types respectively; 3. a BART-based OCR post-correction model for historical bilingual Irish-English data; 4. a historical Irish training set for Named Entity Recognition (NER). All but the first of these four additional outputs appear to be the first of their kind. Each of the project outputs, including the full-text OCR outputs in ALTO XML format, is set for public release to enable open-access research. The paper also identifies the challenges historical Irish data poses to Natural Language Processing (NLP) in general and OCR in particular, and reports on project results and outputs to date. Finally, it contextualises the project within the wider field of NLP and considers its potential impact on under-resourced languages worldwide.