The siParl corpus of Slovene parliamentary proceedings

Andrej Pancur, Tomaž Erjavec


Abstract
The paper describes the process of acquisition, up-translation, encoding, annotation, and distribution of siParl, a collection of the parliamentary debates from the Assembly of the Republic of Slovenia from 1990–2018, covering the period from just before Slovenia became an independent country in 1991, and almost up to the present. The entire corpus, comprising over 8 thousand sessions, 1 million speeches and 200 million words was uniformly encoded in accordance with the TEI-based Parla-CLARIN schema for encoding corpora of parliamentary debates, and contains extensive meta-data about the speakers, a typology of sessions etc. and structural and editorial annotations. The corpus was also part-of-speech tagged and lemmatised using state-of-the-art tools. The corpus is maintained on GitHub with its major versions archived in the CLARIN.SI repository and is available for linguistic analysis in the scope of the on-line CLARIN.SI concordancers, thus offering an invaluable resource for scholars studying Slovenian political history.
Anthology ID:
2020.parlaclarin-1.6
Volume:
Proceedings of the Second ParlaCLARIN Workshop
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Darja Fišer, Maria Eskevich, Franciska de Jong
Venue:
ParlaCLARIN
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
28–34
Language:
English
URL:
https://aclanthology.org/2020.parlaclarin-1.6
DOI:
Bibkey:
Cite (ACL):
Andrej Pancur and Tomaž Erjavec. 2020. The siParl corpus of Slovene parliamentary proceedings. In Proceedings of the Second ParlaCLARIN Workshop, pages 28–34, Marseille, France. European Language Resources Association.
Cite (Informal):
The siParl corpus of Slovene parliamentary proceedings (Pancur & Erjavec, ParlaCLARIN 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.parlaclarin-1.6.pdf