Marco Antonio Stranisci

Also published as: Marco Stranisci


2024

pdf bib
Dissecting Biases in Relation Extraction: A Cross-Dataset Analysis on People’s Gender and Origin
Marco Stranisci | Pere-Lluís Huguet Cabot | Elisa Bassignana | Roberto Navigli
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Relation Extraction (RE) is at the core of many Natural Language Understanding tasks, including knowledge-base population and Question Answering. However, any Natural Language Processing system is exposed to biases, and the analysis of these has not received much attention in RE. We propose a new method for inspecting bias in the RE pipeline, which is completely transparent in terms of interpretability. Specifically, in this work we analyze biases related to gender and place of birth. Our methodology includes (i) obtaining semantic triplets (subject, object, semantic relation) involving ‘person’ entities from RE resources, (ii) collecting meta-information (‘gender’ and ‘place of birth’) using Entity Linking technologies, and then (iii) analyze the distribution of triplets across different groups (e.g., men versus women). We investigate bias at two levels: In the training data of three commonly used RE datasets (SREDFM, CrossRE, NYT), and in the predictions of a state-of-the-art RE approach (ReLiK). To enable cross-dataset analysis, we introduce a taxonomy of relation types mapping the label sets of different RE datasets to a unified label space. Our findings reveal that bias is a compounded issue affecting underrepresented groups within data and predictions for RE.

2023

pdf bib
WikiBio: a Semantic Resource for the Intersectional Analysis of Biographical Events
Marco Antonio Stranisci | Rossana Damiano | Enrico Mensa | Viviana Patti | Daniele Radicioni | Tommaso Caselli
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Biographical event detection is a relevant task that allows for the exploration and comparison of the ways in which people’s lives are told and represented. This may support several real-life applications in digital humanities and in works aimed at exploring bias about minoritized groups. Despite that, there are no corpora and models specifically designed for this task. In this paper we fill this gap by presenting a new corpus annotated for biographical event detection. The corpus, which includes 20 Wikipedia biographies, was aligned with 5 existing corpora in order to train a model for the biographical event detection task. The model was able to detect all mentions of the target-entity in a biography with an F-score of 0.808 and the entity-related events with an F-score of 0.859. Finally, the model was used for performing an analysis of biases about women and non-Western people in Wikipedia biographies.

2022

pdf bib
Guidelines and a Corpus for Extracting Biographical Events
Marco Antonio Stranisci | Enrico Mensa | Rossana Damiano | Daniele Radicioni | Ousmane Diakite
Proceedings of the 18th Joint ACL - ISO Workshop on Interoperable Semantic Annotation within LREC2022

Despite biographies are widely spread within the Semantic Web, resources and approaches to automatically extract biographical events are limited. Such limitation reduces the amount of structured, machine-readable biographical information, especially about people belonging to underrepresented groups. Our work challenges this limitation by providing a set of guidelines for the semantic annotation of life events. The guidelines are designed to be interoperable with existing ISO-standards for semantic annotation: ISO-TimeML (SO-24617-1), and SemAF (ISO-24617-4). Guidelines were tested through an annotation task of Wikipedia biographies of underrepresented writers, namely authors born in non-Western countries, migrants, or belonging to ethnic minorities. 1,000 sentences were annotated by 4 annotators with an average Inter-Annotator Agreement of 0.825. The resulting corpus was mapped on OntoNotes. Such mapping allowed to to expand our corpus, showing that already existing resources may be exploited for the biographical event extraction task.

pdf bib
APPReddit: a Corpus of Reddit Posts Annotated for Appraisal
Marco Antonio Stranisci | Simona Frenda | Eleonora Ceccaldi | Valerio Basile | Rossana Damiano | Viviana Patti
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Despite the large number of computational resources for emotion recognition, there is a lack of data sets relying on appraisal models. According to Appraisal theories, emotions are the outcome of a multi-dimensional evaluation of events. In this paper, we present APPReddit, the first corpus of non-experimental data annotated according to this theory. After describing its development, we compare our resource with enISEAR, a corpus of events created in an experimental setting and annotated for appraisal. Results show that the two corpora can be mapped notwithstanding different typologies of data and annotations schemes. A SVM model trained on APPReddit predicts four appraisal dimensions without significant loss. Merging both corpora in a single training set increases the prediction of 3 out of 4 dimensions. Such findings pave the way to a better performing classification model for appraisal prediction.

pdf bib
O-Dang! The Ontology of Dangerous Speech Messages
Marco Antonio Stranisci | Simona Frenda | Mirko Lai | Oscar Araque | Alessandra Teresa Cignarella | Valerio Basile | Cristina Bosco | Viviana Patti
Proceedings of the 2nd Workshop on Sentiment Analysis and Linguistic Linked Data

Inside the NLP community there is a considerable amount of language resources created, annotated and released every day with the aim of studying specific linguistic phenomena. Despite a variety of attempts in order to organize such resources has been carried on, a lack of systematic methods and of possible interoperability between resources are still present. Furthermore, when storing linguistic information, still nowadays, the most common practice is the concept of “gold standard”, which is in contrast with recent trends in NLP that aim at stressing the importance of different subjectivities and points of view when training machine learning and deep learning methods. In this paper we present O-Dang!: The Ontology of Dangerous Speech Messages, a systematic and interoperable Knowledge Graph (KG) for the collection of linguistic annotated data. O-Dang! is designed to gather and organize Italian datasets into a structured KG, according to the principles shared within the Linguistic Linked Open Data community. The ontology has also been designed to account a perspectivist approach, since it provides a model for encoding both gold standard and single-annotator labels in the KG. The paper is structured as follows. In Section 1 the motivations of our work are outlined. Section 2 describes the O-Dang! Ontology, that provides a common semantic model for the integration of datasets in the KG. The Ontology Population stage with information about corpora, users, and annotations is presented in Section 3. Finally, in Section 4 an analysis of offensiveness across corpora is provided as a first case study for the resource.

2018

pdf bib
An Italian Twitter Corpus of Hate Speech against Immigrants
Manuela Sanguinetti | Fabio Poletto | Cristina Bosco | Viviana Patti | Marco Stranisci
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
Annotating Sentiment and Irony in the Online Italian Political Debate on #labuonascuola
Marco Stranisci | Cristina Bosco | Delia Irazú Hernández Farías | Viviana Patti
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present the TWitterBuonaScuola corpus (TW-BS), a novel Italian linguistic resource for Sentiment Analysis, developed with the main aim of analyzing the online debate on the controversial Italian political reform “Buona Scuola” (Good school), aimed at reorganizing the national educational and training systems. We describe the methodologies applied in the collection and annotation of data. The collection has been driven by the detection of the hashtags mainly used by the participants to the debate, while the annotation has been focused on sentiment polarity and irony, but also extended to mark the aspects of the reform that were mainly discussed in the debate. An in-depth study of the disagreement among annotators is included. We describe the collection and annotation stages, and the in-depth analysis of disagreement made with Crowdflower, a crowdsourcing annotation platform.