Abusive language in Spanish children and young teenager’s conversations: data preparation and short text classification with contextual word embeddings

Marta R. Costa-jussà; Esther González; Asunción Moreno; Eudald Cumalat

Abusive language in Spanish children and young teenager’s conversations: data preparation and short text classification with contextual word embeddings

Marta R. Costa-jussà, Esther González, Asuncion Moreno, Eudald Cumalat

Abstract

Abusive texts are reaching the interests of the scientific and social community. How to automatically detect them is onequestion that is gaining interest in the natural language processing community. The main contribution of this paper is toevaluate the quality of the recently developed ”Spanish Database for cyberbullying prevention” for the purpose of trainingclassifiers on detecting abusive short texts. We compare classical machine learning techniques to the use of a more ad-vanced model: the contextual word embeddings in the particular case of classification of abusive short-texts for the Spanishlanguage. As contextual word embeddings, we use Bidirectional Encoder Representation from Transformers (BERT), pro-posed at the end of 2018. We show that BERT mostly outperforms classical techniques. Far beyond the experimentalimpact of our research, this project aims at planting the seeds for an innovative technological tool with a high potentialsocial impact and aiming at being part of the initiatives in artificial intelligence for social good.

Anthology ID:: 2020.lrec-1.191
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 1533–1537
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.191/
DOI:
Bibkey:
Cite (ACL):: Marta R. Costa-jussà, Esther González, Asuncion Moreno, and Eudald Cumalat. 2020. Abusive language in Spanish children and young teenager’s conversations: data preparation and short text classification with contextual word embeddings. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1533–1537, Marseille, France. European Language Resources Association.
Cite (Informal):: Abusive language in Spanish children and young teenager’s conversations: data preparation and short text classification with contextual word embeddings (Costa-jussà et al., LREC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.lrec-1.191.pdf

PDF Cite Search Fix data