Automatic Detection of Offensive Language in Social Media: Defining Linguistic Criteria to build a Mexican Spanish Dataset

María José Díaz-Torres, Paulina Alejandra Morán-Méndez, Luis Villasenor-Pineda, Manuel Montes-y-Gómez, Juan Aguilera, Luis Meneses-Lerín


Abstract
Phenomena such as bullying, homophobia, sexism and racism have transcended to social networks, motivating the development of tools for their automatic detection. The challenge becomes greater for languages rich in popular sayings, colloquial expressions and idioms which may contain vulgar, profane or rude words, but not always have the intention of offending, as is the case of Mexican Spanish. Under these circumstances, the identification of the offense goes beyond the lexical and syntactic elements of the message. This first work aims to define the main linguistic features of aggressive, offensive and vulgar language in social networks in order to establish linguistic-based criteria to facilitate the identification of abusive language. For this purpose, a Mexican Spanish Twitter corpus was compiled and analyzed. The dataset included words that, despite being rude, need to be considered in context to determine they are part of an offense. Based on the analysis of this corpus, linguistic criteria were defined to determine whether a message is offensive. To simplify the application of these criteria, an easy-to-follow diagram was designed. The paper presents an example of the use of the diagram, as well as the basic statistics of the corpus.
Anthology ID:
2020.trac-1.21
Volume:
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
Month:
May
Year:
2020
Address:
Marseille, France
Venues:
LREC | TRAC | WS
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
132–136
Language:
English
URL:
https://aclanthology.org/2020.trac-1.21
DOI:
Bibkey:
Cite (ACL):
María José Díaz-Torres, Paulina Alejandra Morán-Méndez, Luis Villasenor-Pineda, Manuel Montes-y-Gómez, Juan Aguilera, and Luis Meneses-Lerín. 2020. Automatic Detection of Offensive Language in Social Media: Defining Linguistic Criteria to build a Mexican Spanish Dataset. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, pages 132–136, Marseille, France. European Language Resources Association (ELRA).
Cite (Informal):
Automatic Detection of Offensive Language in Social Media: Defining Linguistic Criteria to build a Mexican Spanish Dataset (Díaz-Torres et al., TRAC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.trac-1.21.pdf