Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources

Tamás Váradi; Bence Nyéki; Svetla Koeva; Marko Tadić; Vanja Štefanec; Maciej Ogrodniczuk; Bartłomiej Nitoń; Piotr Pęzik; Verginica Barbu Mititelu; Elena Irimia; Maria Mitrofan; Dan Tufiş; Radovan Garabík; Simon Krek; Andraž Repar

Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources

Tamás Váradi, Bence Nyéki, Svetla Koeva, Marko Tadić, Vanja Štefanec, Maciej Ogrodniczuk, Bartłomiej Nitoń, Piotr Pęzik, Verginica Barbu Mititelu, Elena Irimia, Maria Mitrofan, Dan Tufiș, Radovan Garabík, Simon Krek, Andraž Repar

Abstract

This article presents the current outcomes of the CURLICAT CEF Telecom project, which aims to collect and deeply annotate a set of large corpora from selected domains. The CURLICAT corpus includes 7 monolingual corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing selected samples from respective national corpora. These corpora are automatically tokenized, lemmatized and morphologically analysed and the named entities annotated. The annotations are uniformly provided for each language specific corpus while the common metadata schema is harmonised across the languages. Additionally, the corpora are annotated for IATE terms in all languages. The file format is CoNLL-U Plus format, containing the ten columns specific to the CoNLL-U format and three extra columns specific to our corpora as defined by Varádi et al. (2020). The CURLICAT corpora represent a rich and valuable source not just for training NMT models, but also for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

Anthology ID:: 2022.lrec-1.11
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 100–108
Language:
URL:: https://aclanthology.org/2022.lrec-1.11/
DOI:
Bibkey:
Cite (ACL):: Tamás Váradi, Bence Nyéki, Svetla Koeva, Marko Tadić, Vanja Štefanec, Maciej Ogrodniczuk, Bartłomiej Nitoń, Piotr Pęzik, Verginica Barbu Mititelu, Elena Irimia, Maria Mitrofan, Dan Tufiș, Radovan Garabík, Simon Krek, and Andraž Repar. 2022. Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 100–108, Marseille, France. European Language Resources Association.
Cite (Informal):: Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources (Váradi et al., LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.11.pdf

PDF Cite Search Fix data