Michele Corazza

2024

Topic Similarity of Heterogeneous Legal Sources Supporting the Legislative Process
Michele Corazza | Leonardo Zilli | Monica Palmirani
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

The legislative process starts with a deep analysis of the existing regulations at European and national levels to avoid conflicts and fostering the into force norms. Also the Constitutional Court decisions play a fundamental role in this analysis for checking the compliance with the constitutional framework and for including the inputs coming from this relevant court in the law-making process. Finally, it is also significant to compare the forthcoming proposal with the already presented bills regarding the same topic. This comparison is crucial to avoid overlapping and to coordinate the democratic dialogue with the different parties. In this light, this paper presents an unsupervised approach for calculating similarity between heterogeneous documents annotated in Akoma Ntoso XML, with the aim to support the information retrieval of similar documents using thematic taxonomy used in legal domain. The prototype has been developed for answering to a call for manifestation of interests launched by the Chamber of Deputy of Italy in order to adopt hybrid AI in the legislation process. It uses a completely unsupervised approach based on Sentence Transformers, meaning that neither annotated data or any fine-tuning process is required.

2022

pdf bib abs

Contextual Unsupervised Clustering of Signs for Ancient Writing Systems
Michele Corazza | Fabio Tamburini | Miguel Valério | Silvia Ferrara
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

The application of machine learning techniques to ancient writing systems is a relatively new idea, and it poses interesting challenges for researchers. One particularly challenging aspect is the scarcity of data for these scripts, which contrasts with the large amounts of data usually available when applying neural models to computational linguistics and other fields. For this reason, any method that attempts to work on ancient scripts needs to be ad-hoc and consider paleographic aspects, in addition to computational ones. Considering the peculiar characteristics of the script that we used is therefore be a crucial part of our work, as any solution needs to consider the particular nature of the writing system that it is applied to. In this work we propose a preliminary evaluation of a novel unsupervised clustering method on Cypro-Greek syllabary, a writing system from Cyprus. This evaluation shows that our method improves clustering performance using information about the attested sequences of signs in combination with an unsupervised model for images, with the future goal of applying the methodology to undeciphered writing systems from a related and typologically similar script.

2020

pdf bib abs

Hybrid Emoji-Based Masked Language Models for Zero-Shot Abusive Language Detection
Michele Corazza | Stefano Menini | Elena Cabrio | Sara Tonelli | Serena Villata
Findings of the Association for Computational Linguistics: EMNLP 2020

Recent studies have demonstrated the effectiveness of cross-lingual language model pre-training on different NLP tasks, such as natural language inference and machine translation. In our work, we test this approach on social media data, which are particularly challenging to process within this framework, since the limited length of the textual messages and the irregularity of the language make it harder to learn meaningful encodings. More specifically, we propose a hybrid emoji-based Masked Language Model (MLM) to leverage the common information conveyed by emojis across different languages and improve the learned cross-lingual representation of short text messages, with the goal to perform zero- shot abusive language detection. We compare the results obtained with the original MLM to the ones obtained by our method, showing improved performance on German, Italian and Spanish.

2019

pdf bib

Cross-Platform Evaluation for Italian Hate Speech Detection
Michele Corazza | Stefano Menini | Elena Cabrio | Sara Tonelli | Serena Villata
Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019)

pdf bib abs

Comparing Automated Methods to Detect Explicit Content in Song Lyrics
Michael Fell | Elena Cabrio | Michele Corazza | Fabien Gandon
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

The Parental Advisory Label (PAL) is a warning label that is placed on audio recordings in recognition of profanity or inappropriate references, with the intention of alerting parents of material potentially unsuitable for children. Since 2015, digital providers – such as iTunes, Spotify, Amazon Music and Deezer – also follow PAL guidelines and tag such tracks as “explicit”. Nowadays, such labelling is carried out mainly manually on voluntary basis, with the drawbacks of being time consuming and therefore costly, error prone and partly a subjective task. In this paper, we compare automated methods ranging from dictionary-based lookup to state-of-the-art deep neural networks to automatically detect explicit contents in English lyrics. We show that more complex models perform only slightly better on this task, and relying on a qualitative analysis of the data, we discuss the inherent hardness and subjectivity of the task.

pdf bib abs

Social media platforms like Twitter and Instagram face a surge in cyberbullying phenomena against young users and need to develop scalable computational methods to limit the negative consequences of this kind of abuse. Despite the number of approaches recently proposed in the Natural Language Processing (NLP) research area for detecting different forms of abusive language, the issue of identifying cyberbullying phenomena at scale is still an unsolved problem. This is because of the need to couple abusive language detection on textual message with network analysis, so that repeated attacks against the same person can be identified. In this paper, we present a system to monitor cyberbullying phenomena by combining message classification and social network analysis. We evaluate the classification module on a data set built on Instagram messages, and we describe the cyberbullying monitoring user interface.

Co-authors

Venues

Fix author