German Parliamentary Corpus (GerParCor)

Giuseppe Abrami; Mevlüt Bagci; Leon Hammerla; Alexander Mehler

German Parliamentary Corpus (GerParCor)

Giuseppe Abrami, Mevlüt Bagci, Leon Hammerla, Alexander Mehler

Abstract

Parliamentary debates represent a large and partly unexploited treasure trove of publicly accessible texts. In the German-speaking area, there is a certain deficit of uniformly accessible and annotated corpora covering all German-speaking parliaments at the national and federal level. To address this gap, we introduce the German Parliamentary Corpus (GerParCor). GerParCor is a genre-specific corpus of (predominantly historical) German-language parliamentary protocols from three centuries and four countries, including state and federal level data. In addition, GerParCor contains conversions of scanned protocols and, in particular, of protocols in Fraktur converted via an OCR process based on Tesseract. All protocols were preprocessed by means of the NLP pipeline of spaCy3 and automatically annotated with metadata regarding their session date. GerParCor is made available in the XMI format of the UIMA project. In this way, GerParCor can be used as a large corpus of historical texts in the field of political communication for various tasks in NLP.

Anthology ID:: 2022.lrec-1.202
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 1900–1906
Language:
URL:: https://aclanthology.org/2022.lrec-1.202/
DOI:
Bibkey:
Cite (ACL):: Giuseppe Abrami, Mevlüt Bagci, Leon Hammerla, and Alexander Mehler. 2022. German Parliamentary Corpus (GerParCor). In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1900–1906, Marseille, France. European Language Resources Association.
Cite (Informal):: German Parliamentary Corpus (GerParCor) (Abrami et al., LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.202.pdf

PDF Cite Search Fix data