corpus-tools.org: An Interoperable Generic Software Tool Set for Multi-layer Linguistic Corpora

Stephan Druskat, Volker Gast, Thomas Krause, Florian Zipser


Abstract
This paper introduces an open source, interoperable generic software tool set catering for the entire workflow of creation, migration, annotation, query and analysis of multi-layer linguistic corpora. It consists of four components: Salt, a graph-based meta model and API for linguistic data, the common data model for the rest of the tool set; Pepper, a conversion tool and platform for linguistic data that can be used to convert many different linguistic formats into each other; Atomic, an extensible, platform-independent multi-layer desktop annotation software for linguistic corpora; ANNIS, a search and visualization architecture for multi-layer linguistic corpora with many different visualizations and a powerful native query language. The set was designed to solve the following issues in a multi-layer corpus workflow: Lossless data transition between tools through a common data model generic enough to allow for a potentially unlimited number of different types of annotation, conversion capabilities for different linguistic formats to cater for the processing of data from different sources and/or with existing annotations, a high level of extensibility to enhance the sustainability of the whole tool set, analysis capabilities encompassing corpus and annotation query alongside multi-faceted visualizations of all annotation layers.
Anthology ID:
L16-1711
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4492–4499
Language:
URL:
https://aclanthology.org/L16-1711
DOI:
Bibkey:
Cite (ACL):
Stephan Druskat, Volker Gast, Thomas Krause, and Florian Zipser. 2016. corpus-tools.org: An Interoperable Generic Software Tool Set for Multi-layer Linguistic Corpora. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4492–4499, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
corpus-tools.org: An Interoperable Generic Software Tool Set for Multi-layer Linguistic Corpora (Druskat et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1711.pdf