PST 2.0 – Corpus of Polish Spatial Texts

Michał Marcińczuk, Marcin Oleksy, Jan Wieczorek


Abstract
In the paper, we focus on modeling spatial expressions in texts. We present the guidelines used to annotate the PST 2.0 (Corpus of Polish Spatial Texts) — a corpus designed for training and testing the tools for spatial expression recognition. The corpus contains a set of texts gathered from texts collected from travel blogs available under Creative Commons license. We have defined our guidelines based on three existing specifications for English (SpatialML, SpatialRole Labelling from SemEval-2013 Task 3 and ISO-Space1.4 from SpaceEval 2014). We briefly present the existing specifications and discuss what modifications have been made to adapt the guidelines to the characteristics of the Polish language. We also describe the process of data collection and manual annotation, including inter-annotator agreement calculation and corpus statistics. In the end, we present detailed statistics of the PST 2.0 corpus, which include the number of components, relations, expressions, and the most common values of spatial indicators, motion indicators, path indicators, distances, directions, and regions.
Anthology ID:
2020.lrec-1.265
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2167–2174
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.265
DOI:
Bibkey:
Cite (ACL):
Michał Marcińczuk, Marcin Oleksy, and Jan Wieczorek. 2020. PST 2.0 – Corpus of Polish Spatial Texts. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2167–2174, Marseille, France. European Language Resources Association.
Cite (Informal):
PST 2.0 – Corpus of Polish Spatial Texts (Marcińczuk et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.265.pdf