PoliTa: A multitagger for Polish

Łukasz Kobyliński


Abstract
Part-of-Speech (POS) tagging is a crucial task in Natural Language Processing (NLP). POS tags may be assigned to tokens in text manually, by trained linguists, or using algorithmic approaches. Particularly, in the case of annotated text corpora, the quantity of textual data makes it unfeasible to rely on manual tagging and automated methods are used extensively. The quality of such methods is of critical importance, as even 1% tagger error rate results in introducing millions of errors in a corpus consisting of a billion tokens. In case of Polish several POS taggers have been proposed to date, but even the best of the taggers achieves an accuracy of ca. 93%, as measured on the one million subcorpus of the National Corpus of Polish (NCP). As the task of tagging is an example of classification, in this article we introduce a new POS tagger for Polish, which is based on the idea of combining several classifiers to produce higher quality tagging results than using any of the taggers individually.
Anthology ID:
L14-1014
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2949–2954
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1018_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Łukasz Kobyliński. 2014. PoliTa: A multitagger for Polish. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2949–2954, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
PoliTa: A multitagger for Polish (Kobyliński, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1018_Paper.pdf