Polish Coreference Corpus in Numbers

Maciej Ogrodniczuk; Mateusz Kopeć; Agata Savary

Polish Coreference Corpus in Numbers

Maciej Ogrodniczuk, Mateusz Kopeć, Agata Savary

Abstract

This paper attempts a preliminary interpretation of the occurrence of different types of linguistic constructs in the manually-annotated Polish Coreference Corpus by providing analyses of various statistical properties related to mentions, clusters and near-identity links. Among others, frequency of mentions, zero subjects and singleton clusters is presented, as well as the average mention and cluster size. We also show that some coreference clustering constraints, such as gender or number agreement, are frequently not valid in case of Polish. The need for lemmatization for automatic coreference resolution is supported by an empirical study. Correlation between cluster and mention count within a text is investigated, with short characteristics of outlier cases. We also examine this correlation in each of the 14 text domains present in the corpus and show that none of them has abnormal frequency of outlier texts regarding the cluster/mention ratio. Finally, we report on our negative experiences concerning the annotation of the near-identity relation. In the conclusion we put forward some guidelines for the future research in the area.

Anthology ID:: L14-1066
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 3234–3238
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1088_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Maciej Ogrodniczuk, Mateusz Kopeć, and Agata Savary. 2014. Polish Coreference Corpus in Numbers. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3234–3238, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: Polish Coreference Corpus in Numbers (Ogrodniczuk et al., LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1088_Paper.pdf

PDF Cite Search Fix data