Serena Villata


pdf bib
CyberAgressionAdo-v1: a Dataset of Annotated Online Aggressions in French Collected through a Role-playing Game
Anaïs Ollagnier | Elena Cabrio | Serena Villata | Catherine Blaya
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Over the past decades, the number of episodes of cyber aggression occurring online has grown substantially, especially among teens. Most solutions investigated by the NLP community to curb such online abusive behaviors consist of supervised approaches relying on annotated data extracted from social media. However, recent studies have highlighted that private instant messaging platforms are major mediums of cyber aggression among teens. As such interactions remain invisible due to the app privacy policies, very few datasets collecting aggressive conversations are available for the computational analysis of language. In order to overcome this limitation, in this paper we present the CyberAgressionAdo-V1 dataset, containing aggressive multiparty chats in French collected through a role-playing game in high-schools, and annotated at different layers. We describe the data collection and annotation phases, carried out in the context of a EU and a national research projects, and provide insightful analysis on the different types of aggression and verbal abuse depending on the targeted victims (individuals or communities) emerging from the collected data.


pdf bib
“Don’t discuss”: Investigating Semantic and Argumentative Features for Supervised Propagandist Message Detection and Classification
Vorakit Vorakitphan | Elena Cabrio | Serena Villata
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

One of the mechanisms through which disinformation is spreading online, in particular through social media, is by employing propaganda techniques. These include specific rhetorical and psychological strategies, ranging from leveraging on emotions to exploiting logical fallacies. In this paper, our goal is to push forward research on propaganda detection based on text analysis, given the crucial role these methods may play to address this main societal issue. More precisely, we propose a supervised approach to classify textual snippets both as propaganda messages and according to the precise applied propaganda technique, as well as a detailed linguistic analysis of the features characterising propaganda information in text (e.g., semantic, sentiment and argumentation features). Extensive experiments conducted on two available propagandist resources (i.e., NLP4IF’19 and SemEval’20-Task 11 datasets) show that the proposed approach, leveraging different language models and the investigated linguistic features, achieves very promising results on propaganda classification, both at sentence- and at fragment-level.

pdf bib
Sifting French Tweets to Investigate the Impact of Covid-19 in Triggering Intense Anxiety
Mohamed Amine Romdhane | Elena Cabrio | Serena Villata
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Sifting French Tweets to Investigate the Impact of Covid-19 in Triggering Intense Anxiety. Social media can be leveraged to understand public sentiment and feelings in real-time, and target public health messages based on user interests and emotions. In this paper, we investigate the impact of the COVID-19 pandemic in triggering intense anxiety, relying on messages exchanged on Twitter. More specifically, we provide : i) a quantitative and qualitative analysis of a corpus of tweets in French related to coronavirus, and ii) a pipeline approach (a filtering mechanism followed by Neural Network methods) to satisfactory classify messages expressing intense anxiety on social media, considering the role played by emotions.

pdf bib
Extraction d’arguments basée sur les transformateurs pour des applications dans le domaine de la santé (Transformer-based Argument Mining for Healthcare Applications)
Tobias Mayer | Elena Cabrio | Serena Villata
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Nous présentons des résumés en français et en anglais de l’article (Mayer et al., 2020) présenté à la conférence 24th European Conference on Artificial Intelligence (ECAI-2020) en 2020.


pdf bib
Regrexit or not Regrexit: Aspect-based Sentiment Analysis in Polarized Contexts
Vorakit Vorakitphan | Marco Guerini | Elena Cabrio | Serena Villata
Proceedings of the 28th International Conference on Computational Linguistics

Emotion analysis in polarized contexts represents a challenge for Natural Language Processing modeling. As a step in the aforementioned direction, we present a methodology to extend the task of Aspect-based Sentiment Analysis (ABSA) toward the affect and emotion representation in polarized settings. In particular, we adopt the three-dimensional model of affect based on Valence, Arousal, and Dominance (VAD). We then present a Brexit scenario that proves how affect varies toward the same aspect when politically polarized stances are presented. Our approach captures aspect-based polarization from newspapers regarding the Brexit scenario of 1.2m entities at sentence-level. We demonstrate how basic constituents of emotions can be mapped to the VAD model, along with their interactions respecting the polarized context in ABSA settings using biased key-concepts (e.g., “stop Brexit” vs. “support Brexit”). Quite intriguingly, the framework achieves to produce coherent aspect evidences of Brexit’s stance from key-concepts, showing that VAD influence the support and opposition aspects.

pdf bib
Proceedings of the 7th Workshop on Argument Mining
Elena Cabrio | Serena Villata
Proceedings of the 7th Workshop on Argument Mining

pdf bib
Hybrid Emoji-Based Masked Language Models for Zero-Shot Abusive Language Detection
Michele Corazza | Stefano Menini | Elena Cabrio | Sara Tonelli | Serena Villata
Findings of the Association for Computational Linguistics: EMNLP 2020

Recent studies have demonstrated the effectiveness of cross-lingual language model pre-training on different NLP tasks, such as natural language inference and machine translation. In our work, we test this approach on social media data, which are particularly challenging to process within this framework, since the limited length of the textual messages and the irregularity of the language make it harder to learn meaningful encodings. More specifically, we propose a hybrid emoji-based Masked Language Model (MLM) to leverage the common information conveyed by emojis across different languages and improve the learned cross-lingual representation of short text messages, with the goal to perform zero- shot abusive language detection. We compare the results obtained with the original MLM to the ones obtained by our method, showing improved performance on German, Italian and Spanish.


pdf bib
Yes, we can! Mining Arguments in 50 Years of US Presidential Campaign Debates
Shohreh Haddadan | Elena Cabrio | Serena Villata
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Political debates offer a rare opportunity for citizens to compare the candidates’ positions on the most controversial topics of the campaign. Thus they represent a natural application scenario for Argument Mining. As existing research lacks solid empirical investigation of the typology of argument components in political debates, we fill this gap by proposing an Argument Mining approach to political debates. We address this task in an empirical manner by annotating 39 political debates from the last 50 years of US presidential campaigns, creating a new corpus of 29k argument components, labeled as premises and claims. We then propose two tasks: (1) identifying the argumentative components in such debates, and (2) classifying them as premises and claims. We show that feature-rich SVM learners and Neural Network architectures outperform standard baselines in Argument Mining over such complex data. We release the new corpus USElecDeb60To16 and the accompanying software under free licenses to the research community.

pdf bib
A System to Monitor Cyberbullying based on Message Classification and Social Network Analysis
Stefano Menini | Giovanni Moretti | Michele Corazza | Elena Cabrio | Sara Tonelli | Serena Villata
Proceedings of the Third Workshop on Abusive Language Online

Social media platforms like Twitter and Instagram face a surge in cyberbullying phenomena against young users and need to develop scalable computational methods to limit the negative consequences of this kind of abuse. Despite the number of approaches recently proposed in the Natural Language Processing (NLP) research area for detecting different forms of abusive language, the issue of identifying cyberbullying phenomena at scale is still an unsolved problem. This is because of the need to couple abusive language detection on textual message with network analysis, so that repeated attacks against the same person can be identified. In this paper, we present a system to monitor cyberbullying phenomena by combining message classification and social network analysis. We evaluate the classification module on a data set built on Instagram messages, and we describe the cyberbullying monitoring user interface.


pdf bib
Evidence Type Classification in Randomized Controlled Trials
Tobias Mayer | Elena Cabrio | Serena Villata
Proceedings of the 5th Workshop on Argument Mining

Randomized Controlled Trials (RCT) are a common type of experimental studies in the medical domain for evidence-based decision making. The ability to automatically extract the arguments proposed therein can be of valuable support for clinicians and practitioners in their daily evidence-based decision making activities. Given the peculiarity of the medical domain and the required level of detail, standard approaches to argument component detection in argument(ation) mining are not fine-grained enough to support such activities. In this paper, we introduce a new sub-task of the argument component identification task: evidence type classification. To address it, we propose a supervised approach and we test it on a set of RCT abstracts on different medical topics.

pdf bib
Increasing Argument Annotation Reproducibility by Using Inter-annotator Agreement to Improve Guidelines
Milagro Teruel | Cristian Cardellino | Fernando Cardellino | Laura Alonso Alemany | Serena Villata
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


pdf bib
Argument Mining on Twitter: Arguments, Facts and Sources
Mihai Dusmanu | Elena Cabrio | Serena Villata
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Social media collect and spread on the Web personal opinions, facts, fake news and all kind of information users may be interested in. Applying argument mining methods to such heterogeneous data sources is a challenging open research issue, in particular considering the peculiarities of the language used to write textual messages on social media. In addition, new issues emerge when dealing with arguments posted on such platforms, such as the need to make a distinction between personal opinions and actual facts, and to detect the source disseminating information about such facts to allow for provenance verification. In this paper, we apply supervised classification to identify arguments on Twitter, and we present two new tasks for argument mining, namely facts recognition and source identification. We study the feasibility of the approaches proposed to address these tasks on a set of tweets related to the Grexit and Brexit news topics.

pdf bib
Legal NERC with ontologies, Wikipedia and curriculum learning
Cristian Cardellino | Milagro Teruel | Laura Alonso Alemany | Serena Villata
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

In this paper, we present a Wikipedia-based approach to develop resources for the legal domain. We establish a mapping between a legal domain ontology, LKIF (Hoekstra et al. 2007), and a Wikipedia-based ontology, YAGO (Suchanek et al. 2007), and through that we populate LKIF. Moreover, we use the mentions of those entities in Wikipedia text to train a specific Named Entity Recognizer and Classifier. We find that this classifier works well in the Wikipedia, but, as could be expected, performance decreases in a corpus of judgments of the European Court of Human Rights. However, this tool will be used as a preprocess for human annotation. We resort to a technique called “curriculum learning” aimed to overcome problems of overfitting by learning increasingly more complex concepts. However, we find that in this particular setting, the method works best by learning from most specific to most general concepts, not the other way round.


pdf bib
DART: a Dataset of Arguments and their Relations on Twitter
Tom Bosc | Elena Cabrio | Serena Villata
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The problem of understanding the stream of messages exchanged on social media such as Facebook and Twitter is becoming a major challenge for automated systems. The tremendous amount of data exchanged on these platforms as well as the specific form of language adopted by social media users constitute a new challenging context for existing argument mining techniques. In this paper, we describe a resource of natural language arguments called DART (Dataset of Arguments and their Relations on Twitter) where the complete argument mining pipeline over Twitter messages is considered: (i) we identify which tweets can be considered as arguments and which cannot, and (ii) we identify what is the relation, i.e., support or attack, linking such tweets to each other.


pdf bib
Classifying Inconsistencies in DBpedia Language Specific Chapters
Elena Cabrio | Serena Villata | Fabien Gandon
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper proposes a methodology to identify and classify the semantic relations holding among the possible different answers obtained for a certain query on DBpedia language specific chapters. The goal is to reconcile information provided by language specific DBpedia chapters to obtain a consistent results set. Starting from the identified semantic relations between two pieces of information, we further classify them as positive or negative, and we exploit bipolar abstract argumentation to represent the result set as a unique graph, where using argumentation semantics we are able to detect the (possible multiple) consistent sets of elements of the query result. We experimented with the proposed methodology over a sample of triples extracted from 10 DBpedia ontology properties. We define the LingRel ontology to represent how the extracted information from different chapters is related to each other, and we map the properties of the LingRel ontology to the properties of the SIOC-Argumentation ontology to built argumentation graphs. The result is a pilot resource that can be profitably used both to train and to evaluate NLP applications querying linked data in detecting the semantic relations among the extracted values, in order to output consistent information sets.


pdf bib
Detecting Bipolar Semantic Relations among Natural Language Arguments with Textual Entailment: a Study.
Elena Cabrio | Serena Villata
Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora


pdf bib
Combining Textual Entailment and Argumentation Theory for Supporting Online Debates Interactions
Elena Cabrio | Serena Villata
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)


pdf bib
Automatic extraction of subcategorization frames for Italian
Dino Ienco | Serena Villata | Cristina Bosco
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Subcategorization is a kind of knowledge which can be considered as crucial in several NLP tasks, such as Information Extraction or parsing, but the collection of very large resources including subcategorization representation is difficult and time-consuming. Various experiences show that the automatic extraction can be a practical and reliable solution for acquiring such a kind of knowledge. The aim of this paper is to investigate the relationships between subcategorization frame extraction and the nature of data from which the frames have to be extracted, e.g. how much the task can be influenced by the richness/poorness of the annotation. Therefore, we present some experiments that apply statistical subcategorization extraction methods, known in literature, on an Italian treebank that exploits a rich set of dependency relations that can be annotated at different degrees of specificity. Benefiting from the availability of relation sets that implement different granularity in the representation of relations, we evaluate our results with reference to previous works in a cross-linguistic perspective.