Offensive Video Detection: Dataset and Baseline Results

Cleber Alcântara, Viviane Moreira, Diego Feijo


Abstract
Web-users produce and publish high volumes of data of various types, such as text, images, and videos. The platforms try to restrain their users from publishing offensive content to keep a friendly and respectful environment and rely on moderators to filter the posts. However, this method is insufficient due to the high volume of publications. The identification of offensive material can be performed automatically using machine learning, which needs annotated datasets. Among the published datasets in this matter, the Portuguese language is underrepresented, and videos are little explored. We investigated the problem of offensive video detection by assembling and publishing a dataset of videos in Portuguese containing mostly textual features. We ran experiments using popular machine learning classifiers used in this domain and reported our findings, alongside multiple evaluation metrics. We found that using word embedding with Deep Learning classifiers achieved the best results on average. CNN architectures, Naive Bayes, and Random Forest ranked top among different experiments. Transfer Learning models outperformed Classic algorithms when processing video transcriptions, but scored lower using other feature sets. These findings can be used as a baseline for future works on this subject.
Anthology ID:
2020.lrec-1.531
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4309–4319
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.531
DOI:
Bibkey:
Cite (ACL):
Cleber Alcântara, Viviane Moreira, and Diego Feijo. 2020. Offensive Video Detection: Dataset and Baseline Results. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4309–4319, Marseille, France. European Language Resources Association.
Cite (Informal):
Offensive Video Detection: Dataset and Baseline Results (Alcântara et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.531.pdf