Croatian Error-Annotated Corpus of Non-Professional Written Language

Vanja Štefanec, Nikola Ljubešić, Jelena Kuvač Kraljević


Abstract
In the paper authors present the Croatian corpus of non-professional written language. Consisting of two subcorpora, i.e. the clinical subcorpus, consisting of written texts produced by speakers with various types of language disorders, and the healthy speakers subcorpus, as well as by the levels of its annotation, it offers an opportunity for different lines of research. The authors present the corpus structure, describe the sampling methodology, explain the levels of annotation, and give some very basic statistics. On the basis of data from the corpus, existing language technologies for Croatian are adapted in order to be implemented in a platform facilitating text production to speakers with language disorders. In this respect, several analyses of the corpus data and a basic evaluation of the developed technologies are presented.
Anthology ID:
L16-1513
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3220–3226
Language:
URL:
https://aclanthology.org/L16-1513
DOI:
Bibkey:
Cite (ACL):
Vanja Štefanec, Nikola Ljubešić, and Jelena Kuvač Kraljević. 2016. Croatian Error-Annotated Corpus of Non-Professional Written Language. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3220–3226, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Croatian Error-Annotated Corpus of Non-Professional Written Language (Štefanec et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1513.pdf