Creation of Polish Online News Corpus for Political Polarization Studies

Joanna Szwoch, Mateusz Staszkow, Rafal Rzepka, Kenji Araki


Abstract
In this paper we describe a Polish news corpus as an attempt to create a filtered, organized and representative set of texts coming from contemporary online press articles from two major Polish TV news providers: commercial TVN24 and state-owned TVP Info. The process consists of web scraping, data cleaning and formatting. A random sample was selected from prepared data to perform a classification task. The random forest achieved the best prediction results out of all considered models. We believe that this dataset is a valuable contribution to existing Polish language corpora as online news are considered to be formal and relatively mistake-free, therefore, a reliable source of correct written language, unlike other online platforms such as blogs or social media. Furthermore, to our knowledge, such corpus from this period of time has not been created before. In the future we would like to expand this dataset with articles coming from other online news providers, repeat the classification task on a bigger scale, utilizing other algorithms. Our data analysis outcomes might be a relevant basis to improve research on a political polarization and propaganda techniques in media.
Anthology ID:
2022.politicalnlp-1.12
Volume:
Proceedings of the LREC 2022 workshop on Natural Language Processing for Political Sciences
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Haithem Afli, Mehwish Alam, Houda Bouamor, Cristina Blasi Casagran, Colleen Boland, Sahar Ghannay
Venue:
PoliticalNLP
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
86–90
Language:
URL:
https://aclanthology.org/2022.politicalnlp-1.12
DOI:
Bibkey:
Cite (ACL):
Joanna Szwoch, Mateusz Staszkow, Rafal Rzepka, and Kenji Araki. 2022. Creation of Polish Online News Corpus for Political Polarization Studies. In Proceedings of the LREC 2022 workshop on Natural Language Processing for Political Sciences, pages 86–90, Marseille, France. European Language Resources Association.
Cite (Informal):
Creation of Polish Online News Corpus for Political Polarization Studies (Szwoch et al., PoliticalNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.politicalnlp-1.12.pdf