Collection and Annotation of the Romanian Legal Corpus

Dan Tufiș, Maria Mitrofan, Vasile Păiș, Radu Ion, Andrei Coman


Abstract
We present the Romanian legislative corpus which is a valuable linguistic asset for the development of machine translation systems, especially for under-resourced languages. The knowledge that can be extracted from this resource is necessary for a deeper understanding of how law terminology is used and how it can be made more consistent. At this moment the corpus contains more than 140k documents representing the legislative body of Romania. This corpus is processed and annotated at different levels: linguistically (tokenized, lemmatized and pos-tagged), dependency parsed, chunked, named entities identified and labeled with IATE terms and EUROVOC descriptors. Each annotated document has a CONLL-U Plus format consisting in 14 columns, in addition to the standard 10-column format, four other types of annotations were added. Moreover the repository will be periodically updated as new legislative texts are published. These will be automatically collected and transmitted to the processing and annotation pipeline. The access to the corpus will be done through ELRC infrastructure.
Anthology ID:
2020.lrec-1.337
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2773–2777
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.337
DOI:
Bibkey:
Cite (ACL):
Dan Tufiș, Maria Mitrofan, Vasile Păiș, Radu Ion, and Andrei Coman. 2020. Collection and Annotation of the Romanian Legal Corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 2773–2777, Marseille, France. European Language Resources Association.
Cite (Informal):
Collection and Annotation of the Romanian Legal Corpus (Tufiș et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.337.pdf