Cysill Ar-lein: A Corpus of Written Contemporary Welsh Compiled from an On-line Spelling and Grammar Checker

Delyth Prys; Gruffudd Prys; Dewi Bryn Jones

Cysill Ar-lein: A Corpus of Written Contemporary Welsh Compiled from an On-line Spelling and Grammar Checker

Delyth Prys, Gruffudd Prys, Dewi Bryn Jones

Abstract

This paper describes the use of a free, on-line language spelling and grammar checking aid as a vehicle for the collection of a significant (31 million words and rising) corpus of text for academic research in the context of less resourced languages where such data in sufficient quantities are often unavailable. It describes two versions of the corpus: the texts as submitted, prior to the correction process, and the texts following the user’s incorporation of any suggested changes. An overview of the corpus’ contents is given and an analysis of use including usage statistics is also provided. Issues surrounding privacy and the anonymization of data are explored as is the data’s potential use for linguistic analysis, lexical research and language modelling. The method used for gathering this corpus is believed to be unique, and is a valuable addition to corpus studies in a minority language.

Anthology ID:: L16-1519
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 3261–3264
Language:
URL:: https://aclanthology.org/L16-1519/
DOI:
Bibkey:
Cite (ACL):: Delyth Prys, Gruffudd Prys, and Dewi Bryn Jones. 2016. Cysill Ar-lein: A Corpus of Written Contemporary Welsh Compiled from an On-line Spelling and Grammar Checker. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3261–3264, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: Cysill Ar-lein: A Corpus of Written Contemporary Welsh Compiled from an On-line Spelling and Grammar Checker (Prys et al., LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1519.pdf

PDF Cite Search Fix data