Introducing the National Corpus of Irish Project

Mícheál Ó Meachair; Úna Bhreathnach; Gearóid Ó Cleircín

Introducing the National Corpus of Irish Project

Mícheál Ó Meachair, Úna Bhreathnach, Gearóid Ó Cleircín

Abstract

This paper introduces the National Corpus of Irish, an initiative to develop a large national corpus of written and spoken contemporary Irish as well as related specialised corpora. The newly-compiled corpora will be hosted at corpas.ie, in what will become a hub for corpus-based research on the Irish language. Users will be able to search the corpora and download data generated during the project from the corpas.ie website and appropriate third-party repositories. Corpus 1 will be a balanced general-purpose corpus containing c.155m words. Corpus 2 will be a written corpus consisting of c100m words. Corpus 3 will be a spoken corpus containing 6.5m words. Corpus 4 will be a monitor corpus with a target size of 1m words per year from 2000 onwards. Token, lemma, and n-gram frequency lists will be published at regular intervals on the project website, and language models will be published there and on other appropriate platforms during the course of the project. This paper focuses on the background and crucial scoping stage of the project, and examines user needs as identified in a survey of potential users.

Anthology ID:: 2022.cltw-1.14
Volume:: Proceedings of the 4th Celtic Language Technology Workshop within LREC2022
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Theodorus Fransen, William Lamb, Delyth Prys
Venue:: CLTW
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 99–103
Language:
URL:: https://aclanthology.org/2022.cltw-1.14/
DOI:
Bibkey:
Cite (ACL):: Mícheál Ó Meachair, Úna Bhreathnach, and Gearóid Ó Cleircín. 2022. Introducing the National Corpus of Irish Project. In Proceedings of the 4th Celtic Language Technology Workshop within LREC2022, pages 99–103, Marseille, France. European Language Resources Association.
Cite (Informal):: Introducing the National Corpus of Irish Project (Ó Meachair et al., CLTW 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.cltw-1.14.pdf

PDF Cite Search Fix data