The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language

Dan Tufiș, Verginica Barbu Mititelu, Elena Irimia, Ștefan Daniel Dumitrescu, Tiberiu Boroș


Abstract
The article describes the current status of a large national project, CoRoLa, aiming at building a reference corpus for the contemporary Romanian language. Unlike many other national corpora, CoRoLa contains only - IPR cleared texts and speech data, obtained from some of the country’s most representative publishing houses, broadcasting agencies, editorial offices, newspapers and popular bloggers. For the written component 500 million tokens are targeted and for the oral one 300 hours of recordings. The choice of texts is done according to their functional style, domain and subdomain, also with an eye to the international practice. A metadata file (following the CMDI model) is associated to each text file. Collected texts are cleaned and transformed in a format compatible with the tools for automatic processing (segmentation, tokenization, lemmatization, part-of-speech tagging). The paper also presents up-to-date statistics about the structure of the corpus almost two years before its official launching. The corpus will be freely available for searching. Users will be able to download the results of their searches and those original files when not against stipulations in the protocols we have with text providers.
Anthology ID:
L16-1399
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2516–2521
Language:
URL:
https://aclanthology.org/L16-1399
DOI:
Bibkey:
Cite (ACL):
Dan Tufiș, Verginica Barbu Mititelu, Elena Irimia, Ștefan Daniel Dumitrescu, and Tiberiu Boroș. 2016. The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2516–2521, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language (Tufiș et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1399.pdf