GenitivDB — a Corpus-Generated Database for German Genitive Classification

Roman Schneider

GenitivDB — a Corpus-Generated Database for German Genitive Classification

Abstract

We present a novel NLP resource for the explanation of linguistic phenomena, built and evaluated exploring very large annotated language corpora. For the compilation, we use the German Reference Corpus (DeReKo) with more than 5 billion word forms, which is the largest linguistic resource worldwide for the study of contemporary written German. The result is a comprehensive database of German genitive formations, enriched with a broad range of intra- und extralinguistic metadata. It can be used for the notoriously controversial classification and prediction of genitive endings (short endings, long endings, zero-marker). We also evaluate the main factors influencing the use of specific endings. To get a general idea about a factors influences and its side effects, we calculate chi-square-tests and visualize the residuals with an association plot. The results are evaluated against a gold standard by implementing tree-based machine learning algorithms. For the statistical analysis, we applied the supervised LMT Logistic Model Trees algorithm, using the WEKA software. We intend to use this gold standard to evaluate GenitivDB, as well as to explore methodologies for a predictive genitive model.

Anthology ID:: L14-1304
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 988–994
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/346_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Roman Schneider. 2014. GenitivDB — a Corpus-Generated Database for German Genitive Classification. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 988–994, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: GenitivDB — a Corpus-Generated Database for German Genitive Classification (Schneider, LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/346_Paper.pdf

PDF Cite Search Fix data