Datasets of Slovene and Croatian Moderated News Comments

Nikola Ljubešić, Tomaž Erjavec, Darja Fišer


Abstract
This paper presents two large newly constructed datasets of moderated news comments from two highly popular online news portals in the respective countries: the Slovene RTV MCC and the Croatian 24sata. The datasets are analyzed by performing manual annotation of the types of the content which have been deleted by moderators and by investigating deletion trends among users and threads. Next, initial experiments on automatically detecting the deleted content in the datasets are presented. Both datasets are published in encrypted form, to enable others to perform experiments on detecting content to be deleted without revealing potentially inappropriate content. Finally, the baseline classification models trained on the non-encrypted datasets are disseminated as well to enable real-world use.
Anthology ID:
W18-5116
Volume:
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
Month:
October
Year:
2018
Address:
Brussels, Belgium
Editors:
Darja Fišer, Ruihong Huang, Vinodkumar Prabhakaran, Rob Voigt, Zeerak Waseem, Jacqueline Wernimont
Venue:
ALW
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
124–131
Language:
URL:
https://aclanthology.org/W18-5116/
DOI:
10.18653/v1/W18-5116
Bibkey:
Cite (ACL):
Nikola Ljubešić, Tomaž Erjavec, and Darja Fišer. 2018. Datasets of Slovene and Croatian Moderated News Comments. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 124–131, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Datasets of Slovene and Croatian Moderated News Comments (Ljubešić et al., ALW 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-5116.pdf