Proceedings of the 5th Celtic Language Technology Workshop

Brian Davis, Theodorus Fransen, Elaine Ui Dhonnchadha, Abigail Walsh (Editors)


Anthology ID:
2025.cltw-1
Month:
January
Year:
2025
Address:
Abu Dhabi [Virtual Workshop]
Venues:
CLTW | WS
SIG:
Publisher:
International Committee on Computational Linguistics
URL:
https://aclanthology.org/2025.cltw-1/
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2025.cltw-1.pdf

pdf bib
Proceedings of the 5th Celtic Language Technology Workshop
Brian Davis | Theodorus Fransen | Elaine Ui Dhonnchadha | Abigail Walsh

pdf bib
An Assessment of Word Separation Practices in Old Irish Text Resources and a Universal Method for Tokenising Old Irish Text
Adrian Doyle | John P. McCrae

The quantity of Old Irish text which survives in contemporary manuscripts is relatively small by comparison to what is available for well-resourced modern languages. Moreover, as it is a historical language, no more text will ever be generated by native speakers of Old Irish. This makes the text which has survived particularly valuable, and ideally, all of it would be annotated using a single, common annotation standard, thereby ensuring compatibility between text resources. At present, Old Irish text repositories separate words or sub-word morphemes in accordance with different methodologies, and each uses a different style of lexical annotation. This makes it difficult to utilise content from more than any one repository in NLP applications. This paper provides an assessment of distinctions between existing annotated corpora, showing that the primary point of divergence is at the token level. For this reason, this paper also describes a new method for tokenising Old Irish text. This method can be applied even to diplomatic editions, and has already been utilised in various text resources.

pdf bib
Synthesising a Corpus of Gaelic Traditional Narrative with Cross-Lingual Text Expansion
William Lamb | Dongge Han | Ondrej Klejch | Beatrice Alex | Peter Bell

Advances in large language modelling have disproportionately benefited high-resource languages due to their vastly greater training data reserves. This paper proposes a novel cross-lingual text expansion (XLTE) technique using multilingual large language models (MLLMs) to mitigate data sparsity in low-resource languages. We apply XLTE to the domain of traditional Scottish Gaelic storytelling to generate a training corpus suitable for language modelling, for example as part of an automatic speech recognition system. The effectiveness of this technique is demonstrated using OpenAI’s GPT-4o, with supervised fine-tuning (SFT) providing decreased neologism rates and a 57.2% reduction in perplexity over the baseline model. Despite these promising results, qualitative analyses reveal important stylistic divergences between synthesised and genuine data. Nevertheless, XLTE offers a promising, scalable method for synthesising training sets in other languages and domains, opening avenues for further improvements in low-resource language modelling.

pdf bib
A Pragmatic Approach to Using Artificial Intelligence and Virtual Reality in Digital Game-Based Language Learning
Monica Ward | Liang Xu | Elaine Uí Dhonnchadha

Computer-Assisted Language Learning (CALL) applications have many benefits for language learning. However, they can be difficult to develop for low-resource languages such as Irish and the other Celtic languages. It can be difficult to assemble the multidisciplinary team needed to develop CALL resources and there are fewer language resources available for the language. This paper provides an overview of a pragmatic approach to using Artificial Intelligence (AI) and Virtual Reality (VR) in developing a Digital Game-Based Language Learning (DGBLL) app for Irish. This pragmatic approach was used to develop Cipher - a DGBLL app for Irish (Xu et al, 2022b) where a number of existing resources including text repositories and NLP tools were used. In this paper the focus is on the incorporation of Artificial Intelligence (AI) technologies including AI image generation, text-to-speech (TTS) and Virtual Reality (VR), in a pedagogically informed manner to support language learning in a way that is both challenging and enjoyable. Cipher has been designed to be language independent and can be adapted for various cohorts of learners and for other languages. Cipher has been played and tested in a number of schools in Dublin and the feedback from teachers and students has been very positive. This paper outlines how AI and VR technologies have been utilised in Cipher and how it could be adapted to other Celtic languages and low-resource languages in general.

pdf bib
Fotheidil: an Automatic Transcription System for the Irish Language
Liam Lonergan | Ibon Saratxaga | John Sloan | Oscar Maharg Bravo | Mengjie Qian | Neasa Ní Chiaráin | Christer Gobl | Ailbhe Ní Chasaide

This paper sets out the first web-based transcription system for the Irish language - Fotheidil, a system that utilises speech-related AI technologies as part of the ABAIR initiative. The system includes both off-the-shelf pre-trained voice activity detection and speaker diarisation models and models trained specifically for Irish automatic speech recognition and capitalisation and punctuation restoration. Semi-supervised learning is explored to improve the acoustic model of a modular TDNN-HMM ASR system, yielding substantial improvements for out-of-domain test sets and dialects that are underrepresented in the supervised training set. A novel approach to capitalisation and punctuation restoration involving sequence-to-sequence models is compared with the conventional approach using a classification model. Experimental results show here also substantial improvements in performance. It is intended that will be made freely available for public use, and represents an important resource researchers and others who transcribe Irish language materials. Human-corrected transcriptions will be collected and included in the training dataset as the system is used, which should lead to incremental improvements to the ASR model in a cyclical, community-driven fashion.

pdf bib
Gaeilge Bhriste ó Shamhlacha Cliste: How Clever Are LLMs When Translating Irish Text?
Teresa Clifford | Abigail Walsh | Brian Davis | Mícheál J. Ó Meachair

Large Language Models have been widely adopted in NLP tasks and applications, how- ever, their ability to accurately process Irish and other minority languages has not been fully explored. In this paper we describe prelim- inary experiments examining the capacity of publicly-available machine translation engines (Google Translate, Microsoft Bing, and eTrans- lation) and prompt-based AI systems systems (ChatGPT 3.5, Llama 2) for translating and handling challenging language features of Irish. A hand-crafted selection of challenging Irish language features were incorporated into trans- lation prompts, and the output from each model was examined by a human evaluator. The re- sults of these experiments indicate that these LLM-based models still struggle with translat- ing rare linguistic phenomena and ambiguous constructions. This preliminary analysis helps to inform further research in this field, pro- viding a simple ranking of publicly-available models, and indicating which language features require particular attention when evaluating model capacity.