Towards the Annotation of Named Entities in the National Corpus of Polish

Agata Savary; Jakub Waszczuk; Adam Przepiórkowski

Towards the Annotation of Named Entities in the National Corpus of Polish

Agata Savary, Jakub Waszczuk, Adam Przepiórkowski

Abstract

We present the named entity annotation task within the on-going project of the National Corpus of Polish. To the best of our knowledge, this is the first attempt at a large-scale corpus annotation of Polish named entities. We describe the scope and the TEI-inspired hierarchy of named entities admitted for this task, as well as the TEI-conformant multi-level stand-off annotation format. We also discuss some methodological strategies including the annotation of embedded, coordinated and discontinuous names. Our annotation platform consists of two main tools interconnected by converting facilities. A rule-based natural language processing platform SProUT is used for the automatic pre-annotation of named entities, due to the previously created Polish extraction grammars adapted to the annotation task. A customizable graphical tree editor TrEd, extended to our needs, provides an ergonomic environment for manual correction of annotations. Despite some difficult cases encountered in the early annotation phase, about 2,600 named entities in 1,800 corpus sentences have presently been annotated, which allowed to validate the project methodology and tools.

Anthology ID:: L10-1605
Volume:: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:: May
Year:: 2010
Address:: Valletta, Malta
Editors:: Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/879_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Agata Savary, Jakub Waszczuk, and Adam Przepiórkowski. 2010. Towards the Annotation of Named Entities in the National Corpus of Polish. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):: Towards the Annotation of Named Entities in the National Corpus of Polish (Savary et al., LREC 2010)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/879_Paper.pdf

PDF Cite Search Fix data