2022
pdf
bib
abs
Contextual Unsupervised Clustering of Signs for Ancient Writing Systems
Michele Corazza
|
Fabio Tamburini
|
Miguel Valério
|
Silvia Ferrara
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
The application of machine learning techniques to ancient writing systems is a relatively new idea, and it poses interesting challenges for researchers. One particularly challenging aspect is the scarcity of data for these scripts, which contrasts with the large amounts of data usually available when applying neural models to computational linguistics and other fields. For this reason, any method that attempts to work on ancient scripts needs to be ad-hoc and consider paleographic aspects, in addition to computational ones. Considering the peculiar characteristics of the script that we used is therefore be a crucial part of our work, as any solution needs to consider the particular nature of the writing system that it is applied to. In this work we propose a preliminary evaluation of a novel unsupervised clustering method on Cypro-Greek syllabary, a writing system from Cyprus. This evaluation shows that our method improves clustering performance using information about the attested sequences of signs in combination with an unsupervised model for images, with the future goal of applying the methodology to undeciphered writing systems from a related and typologically similar script.
2020
pdf
bib
abs
Hybrid Emoji-Based Masked Language Models for Zero-Shot Abusive Language Detection
Michele Corazza
|
Stefano Menini
|
Elena Cabrio
|
Sara Tonelli
|
Serena Villata
Findings of the Association for Computational Linguistics: EMNLP 2020
Recent studies have demonstrated the effectiveness of cross-lingual language model pre-training on different NLP tasks, such as natural language inference and machine translation. In our work, we test this approach on social media data, which are particularly challenging to process within this framework, since the limited length of the textual messages and the irregularity of the language make it harder to learn meaningful encodings. More specifically, we propose a hybrid emoji-based Masked Language Model (MLM) to leverage the common information conveyed by emojis across different languages and improve the learned cross-lingual representation of short text messages, with the goal to perform zero- shot abusive language detection. We compare the results obtained with the original MLM to the ones obtained by our method, showing improved performance on German, Italian and Spanish.
2019
pdf
bib
abs
A System to Monitor Cyberbullying based on Message Classification and Social Network Analysis
Stefano Menini
|
Giovanni Moretti
|
Michele Corazza
|
Elena Cabrio
|
Sara Tonelli
|
Serena Villata
Proceedings of the Third Workshop on Abusive Language Online
Social media platforms like Twitter and Instagram face a surge in cyberbullying phenomena against young users and need to develop scalable computational methods to limit the negative consequences of this kind of abuse. Despite the number of approaches recently proposed in the Natural Language Processing (NLP) research area for detecting different forms of abusive language, the issue of identifying cyberbullying phenomena at scale is still an unsolved problem. This is because of the need to couple abusive language detection on textual message with network analysis, so that repeated attacks against the same person can be identified. In this paper, we present a system to monitor cyberbullying phenomena by combining message classification and social network analysis. We evaluate the classification module on a data set built on Instagram messages, and we describe the cyberbullying monitoring user interface.
pdf
bib
abs
Comparing Automated Methods to Detect Explicit Content in Song Lyrics
Michael Fell
|
Elena Cabrio
|
Michele Corazza
|
Fabien Gandon
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
The Parental Advisory Label (PAL) is a warning label that is placed on audio recordings in recognition of profanity or inappropriate references, with the intention of alerting parents of material potentially unsuitable for children. Since 2015, digital providers – such as iTunes, Spotify, Amazon Music and Deezer – also follow PAL guidelines and tag such tracks as “explicit”. Nowadays, such labelling is carried out mainly manually on voluntary basis, with the drawbacks of being time consuming and therefore costly, error prone and partly a subjective task. In this paper, we compare automated methods ranging from dictionary-based lookup to state-of-the-art deep neural networks to automatically detect explicit contents in English lyrics. We show that more complex models perform only slightly better on this task, and relying on a qualitative analysis of the data, we discuss the inherent hardness and subjectivity of the task.