Lexical Correction of Polish Twitter Political Data

Maciej Ogrodniczuk, Mateusz Kopeć


Abstract
Language processing architectures are often evaluated in near-to-perfect conditions with respect to processed content. The tools which perform sufficiently well on electronic press, books and other type of non-interactive content may poorly handle littered, colloquial and multilingual textual data which make the majority of communication today. This paper aims at investigating how Polish Twitter data (in a slightly controlled ‘political’ flavour) differs from expectation of linguistic tools and how they could be corrected to be ready for processing by standard language processing chains available for Polish. The setting includes specialised components for spelling correction of tweets as well as hashtag and username decoding.
Anthology ID:
W17-2215
Volume:
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Month:
August
Year:
2017
Address:
Vancouver, Canada
Venue:
LaTeCH
SIG:
SIGHUM
Publisher:
Association for Computational Linguistics
Note:
Pages:
115–125
Language:
URL:
https://aclanthology.org/W17-2215
DOI:
10.18653/v1/W17-2215
Bibkey:
Cite (ACL):
Maciej Ogrodniczuk and Mateusz Kopeć. 2017. Lexical Correction of Polish Twitter Political Data. In Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 115–125, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):
Lexical Correction of Polish Twitter Political Data (Ogrodniczuk & Kopeć, LaTeCH 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-2215.pdf