Merging Datasets for Aggressive Text Identification

Paula Fortuna; José Ferreira; Luiz Pires; Guilherme Routar; Sérgio Nunes

Merging Datasets for Aggressive Text Identification

Paula Fortuna, José Ferreira, Luiz Pires, Guilherme Routar, Sérgio Nunes

Abstract

This paper presents the approach of the team “groutar” to the shared task on Aggression Identification, considering the test sets in English, both from Facebook and general Social Media. This experiment aims to test the effect of merging new datasets in the performance of classification models. We followed a standard machine learning approach with training, validation, and testing phases, and considered features such as part-of-speech, frequencies of insults, punctuation, sentiment, and capitalization. In terms of algorithms, we experimented with Boosted Logistic Regression, Multi-Layer Perceptron, Parallel Random Forest and eXtreme Gradient Boosting. One question appearing was how to merge datasets using different classification systems (e.g. aggression vs. toxicity). Other issue concerns the possibility to generalize models and apply them to data from different social networks. Regarding these, we merged two datasets, and the results showed that training with similar data is an advantage in the classification of social networks data. However, adding data from different platforms, allowed slightly better results in both Facebook and Social Media, indicating that more generalized models can be an advantage.

Anthology ID:: W18-4416
Volume:: Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)
Month:: August
Year:: 2018
Address:: Santa Fe, New Mexico, USA
Editors:: Ritesh Kumar, Atul Kr. Ojha, Marcos Zampieri, Shervin Malmasi
Venue:: TRAC
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 128–139
Language:
URL:: https://aclanthology.org/W18-4416/
DOI:
Bibkey:
Cite (ACL):: Paula Fortuna, José Ferreira, Luiz Pires, Guilherme Routar, and Sérgio Nunes. 2018. Merging Datasets for Aggressive Text Identification. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), pages 128–139, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):: Merging Datasets for Aggressive Text Identification (Fortuna et al., TRAC 2018)
Copy Citation:
PDF:: https://aclanthology.org/W18-4416.pdf

PDF Cite Search Fix data