Automatic Annotation and Manual Evaluation of the Diachronic German Corpus TüBa-D/DC

Erhard Hinrichs, Thomas Zastrow


Abstract
This paper presents the Tübingen Baumbank des Deutschen Diachron (TüBa-D/DC), a linguistically annotated corpus of selected diachronic materials from the German Gutenberg Project. It was automatically annotated by a suite of NLP tools integrated into WebLicht, the linguistic chaining tool used in CLARIN-D. The annotation quality has been evaluated manually for a subcorpus ranging from Middle High German to Modern High German. The integration of the TüBa-D/DC into the CLARIN-D infrastructure includes metadata provision and harvesting as well as sustainable data storage in the Tübingen CLARIN-D center. The paper further provides an overview of the possibilities of accessing the TüBa-D/DC data. Methods for full-text search of the metadata and object data and for annotation-based search of the object data are described in detail. The WebLicht Service Oriented Architecture is used as an integrated environment for annotation based search of the TüBa-D/DC. WebLicht thus not only serves as the annotation platform for the TüBa-D/DC, but also as a generic user interface for accessing and visualizing it.
Anthology ID:
L12-1033
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1622–1627
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/166_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Erhard Hinrichs and Thomas Zastrow. 2012. Automatic Annotation and Manual Evaluation of the Diachronic German Corpus TüBa-D/DC. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1622–1627, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Automatic Annotation and Manual Evaluation of the Diachronic German Corpus TüBa-D/DC (Hinrichs & Zastrow, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/166_Paper.pdf