Hugo Sousa


2024

pdf bib
Text2Story Lusa: A Dataset for Narrative Analysis in European Portuguese News Articles
Sérgio Nunes | Alípio Mario Jorge | Evelin Amorim | Hugo Sousa | António Leal | Purificação Moura Silvano | Inês Cantante | Ricardo Campos
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Narratives have been the subject of extensive research across various scientific fields such as linguistics and computer science. However, the scarcity of freely available datasets, essential for studying this genre, remains a significant obstacle. Furthermore, datasets annotated with narratives components and their morphosyntactic and semantic information are even scarcer. To address this gap, we developed the Text2Story Lusa datasets, which consist of a collection of news articles in European Portuguese. The first datasets consists of 357 news articles and the second dataset comprises a subset of 117 manually densely annotated articles, totaling over 50 thousand individual annotations. By focusing on texts with substantial narrative elements, we aim to provide a valuable resource for studying narrative structures in European Portuguese news articles. On the one hand, the first dataset provides researchers with data to study narratives from various perspectives. On the other hand, the annotated dataset facilitates research in information extraction and related tasks, particularly in the context of narrative extraction pipelines. Both datasets are made available adhering to FAIR principles, thereby enhancing their utility within the research community.