TGermaCorp – A (Digital) Humanities Resource for (Computational) Linguistics

Andy Luecking; Armin Hoenen; Alexander Mehler

TGermaCorp – A (Digital) Humanities Resource for (Computational) Linguistics

Andy Luecking, Armin Hoenen, Alexander Mehler

Abstract

TGermaCorp is a German text corpus whose primary sources are collected from German literature texts which date from the sixteenth century to the present. The corpus is intended to represent its target language (German) in syntactic, lexical, stylistic and chronological diversity. For this purpose, it is hand-annotated on several linguistic layers, including POS, lemma, named entities, multiword expressions, clauses, sentences and paragraphs. In order to introduce TGermaCorp in comparison to more homogeneous corpora of contemporary everyday language, quantitative assessments of syntactic and lexical diversity are provided. In this respect, TGermaCorp contributes to establishing characterising features for resource descriptions, which is needed for keeping track of a meaningful comparison of the ever-growing number of natural language resources. The assessments confirm the special role of proper names, whose propagation in text may influence lexical and syntactic diversity measures in rather trivial ways. TGermaCorp will be made available via hucompute.org.

Anthology ID:: L16-1677
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 4271–4277
Language:
URL:: https://aclanthology.org/L16-1677/
DOI:
Bibkey:
Cite (ACL):: Andy Luecking, Armin Hoenen, and Alexander Mehler. 2016. TGermaCorp – A (Digital) Humanities Resource for (Computational) Linguistics. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4271–4277, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: TGermaCorp – A (Digital) Humanities Resource for (Computational) Linguistics (Luecking et al., LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1677.pdf

PDF Cite Search Fix data