Corpus Services: A Framework to Curate XML Corpus Data

Aleksandr Riaposov, Elena Lazarenko


Abstract
This paper provides a comprehensive description of the Corpus Services framework—a collection of Java validation tools for language corpora compiled in XML-based data formats, in particular those using EXMARaLDA corpus software. Having successfully found application in several research projects, the core functionality of the framework is currently integrated in the automated curation and publication workflows for EXMARaLDA-driven corpora of Northern Eurasian languages, as developed by the long-term project INEL. Preliminary stages of development and examples of practical use cases are covered, a structured explanation of the framework’s current functionality and operational mechanisms is provided. Furthermore, the utilization of Corpus Services is extensively illustrated within the context of INEL workflows.
Anthology ID:
2024.lrec-main.358
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
4030–4035
Language:
URL:
https://aclanthology.org/2024.lrec-main.358
DOI:
Bibkey:
Cite (ACL):
Aleksandr Riaposov and Elena Lazarenko. 2024. Corpus Services: A Framework to Curate XML Corpus Data. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4030–4035, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Corpus Services: A Framework to Curate XML Corpus Data (Riaposov & Lazarenko, LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.358.pdf