Can Stanza be Used for Part-of-Speech Tagging Historical Polish?

Maria Irena Szawerna


Abstract
The goal of this paper is to evaluate the performance of Stanza, a part-of-speech (POS) tagger developed for modern Polish, on historical text to assess its possible use for automating the annotation of other historical texts. While the issue of the reliability of utilizing POS taggers on historical data has been previously discussed, most of the research focuses on languages whose grammar differs from Polish, meaning that their results need not be fully applicable in this case. The evaluation of Stanza is conducted on two sets of 10286 and 3270 manually annotated tokens from a piece of historical Polish writing (1899), and the errors are analyzed qualitatively and quantitatively. The results show a good performance of the tagger, especially when it comes to Universal Part-of-Speech (UPOS) tags, which is promising for utilizing the tagger for automatic annotation in larger projects, and pinpoint some common features of misclassified tokens.
Anthology ID:
2024.eacl-srw.4
Volume:
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Neele Falk, Sara Papi, Mike Zhang
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
44–49
Language:
URL:
https://aclanthology.org/2024.eacl-srw.4
DOI:
Bibkey:
Cite (ACL):
Maria Irena Szawerna. 2024. Can Stanza be Used for Part-of-Speech Tagging Historical Polish?. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 44–49, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Can Stanza be Used for Part-of-Speech Tagging Historical Polish? (Szawerna, EACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.eacl-srw.4.pdf