Tagger for Polish Computer Mediated Communication Texts

Wiktor Walentynowicz, Maciej Piasecki, Marcin Oleksy


Abstract
In this paper we present a morpho-syntactic tagger dedicated to Computer-mediated Communication texts in Polish. Its construction is based on an expanded RNN-based neural network adapted to the work on noisy texts. Among several techniques, the tagger utilises fastText embedding vectors, sequential character embedding vectors, and Brown clustering for the coarse-grained representation of sentence structures. In addition a set of manually written rules was proposed for post-processing. The system was trained to disambiguate descriptions of words in relation to Parts of Speech tags together with the full morphological information in terms of values for the different grammatical categories. We present also evaluation of several model variants on the gold standard annotated CMC data, comparison to the state-of-the-art taggers for Polish and error analysis. The proposed tagger shows significantly better results in this domain and demonstrates the viability of adaptation.
Anthology ID:
R19-1148
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
1295–1303
Language:
URL:
https://aclanthology.org/R19-1148
DOI:
10.26615/978-954-452-056-4_148
Bibkey:
Cite (ACL):
Wiktor Walentynowicz, Maciej Piasecki, and Marcin Oleksy. 2019. Tagger for Polish Computer Mediated Communication Texts. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 1295–1303, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
Tagger for Polish Computer Mediated Communication Texts (Walentynowicz et al., RANLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/R19-1148.pdf