An Assessment of Word Separation Practices in Old Irish Text Resources and a Universal Method for Tokenising Old Irish Text

Adrian Doyle; John Philip McCrae

An Assessment of Word Separation Practices in Old Irish Text Resources and a Universal Method for Tokenising Old Irish Text

Abstract

The quantity of Old Irish text which survives in contemporary manuscripts is relatively small by comparison to what is available for well-resourced modern languages. Moreover, as it is a historical language, no more text will ever be generated by native speakers of Old Irish. This makes the text which has survived particularly valuable, and ideally, all of it would be annotated using a single, common annotation standard, thereby ensuring compatibility between text resources. At present, Old Irish text repositories separate words or sub-word morphemes in accordance with different methodologies, and each uses a different style of lexical annotation. This makes it difficult to utilise content from more than any one repository in NLP applications. This paper provides an assessment of distinctions between existing annotated corpora, showing that the primary point of divergence is at the token level. For this reason, this paper also describes a new method for tokenising Old Irish text. This method can be applied even to diplomatic editions, and has already been utilised in various text resources.

Anthology ID:: 2025.cltw-1.1
Volume:: Proceedings of the 5th Celtic Language Technology Workshop
Month:: January
Year:: 2025
Address:: Abu Dhabi [Virtual Workshop]
Editors:: Brian Davis, Theodorus Fransen, Elaine Uí Dhonnchadha, Abigail Walsh
Venues:: CLTW | WS
SIG:
Publisher:: International Committee on Computational Linguistics
Note:
Pages:: 1–11
Language:
URL:: https://aclanthology.org/2025.cltw-1.1/
DOI:
Bibkey:
Cite (ACL):: Adrian Doyle and John P. McCrae. 2025. An Assessment of Word Separation Practices in Old Irish Text Resources and a Universal Method for Tokenising Old Irish Text. In Proceedings of the 5th Celtic Language Technology Workshop, pages 1–11, Abu Dhabi [Virtual Workshop]. International Committee on Computational Linguistics.
Cite (Informal):: An Assessment of Word Separation Practices in Old Irish Text Resources and a Universal Method for Tokenising Old Irish Text (Doyle & McCrae, CLTW 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.cltw-1.1.pdf

PDF Cite Search Fix data