Roser Morante

2025

In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and manually translated into English, and have not ever been publicly released, ensuring minimal contamination when evaluating Large Language Models with this dataset. A selection of current open-source and proprietary models are evaluated in a uniform zero-shot experimental setting both on the UNED-ACCESS 2024 dataset and on an equivalent subset of MMLU questions. Results show that (i) Smaller models not only perform worse than the largest models, but also degrade faster in Spanish than in English. The performance gap between both languages is negligible for the best models, but grows up to 37% for smaller models; (ii) Model ranking on UNED-ACCESS 2024 is almost identical (0.98 Pearson correlation) to the one obtained with MMLU (a similar, but publicly available benchmark), suggesting that contamination affects similarly to all models, and (iii) As in publicly available datasets, reasoning questions in UNED-ACCESS are more challenging for models of all sizes.

2024

pdf bib abs

The Kronieken Corpus: an Annotated Collection of Dutch/Flemish Chronicles from 1500-1850
Theo Dekker | Erika Kuijpers | Alie Lassche | Carolina Lenarduzzi | Roser Morante | Judith Pollmann
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)

In this paper we present the Kronieken Corpus, a new digital collection of 204 chronicles written in Dutch/Flemish between 1500 and 1850, which have been scanned, transcribed and annotated with named entities, dates, pages and a smaller part with sources and attributions. The texts belong to 308 physical volumes and contain between 23 and 24 million words. 107 chronicles, or 178 chronicle volumes, collected from 39 different archives and libraries in The Netherlands and Belgium and transcribed by volunteers had never been transcribed or published before. The result is a unique enriched historical text corpus of original hand-written, non-canonical and non-fiction text by lay people from the early modern period.

pdf bib abs

This paper presents a new web portal with information about the state of the art of natural language processing tasks in Spanish. It provides information about forums, competitions, tasks and datasets in Spanish, that would otherwise be spread in multiple articles and web sites. The portal consists of overview pages where information can be searched for and filtered by several criteria and individual pages with detailed information and hyperlinks to facilitate navigation. Information has been manually curated from publications that describe competitions and NLP tasks from 2013 until 2023 and will be updated as new tasks appear. A total of 185 tasks and 128 datasets from 94 competitions have been introduced.

2022

pdf bib abs

Leveraging Social Media as a Source for Clinical Guidelines: A Demarcation of Experiential Knowledge
Jia-Zhen Michelle Chan | Florian Kunneman | Roser Morante | Lea Lösch | Teun Zuiderent-Jerak
Proceedings of the Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

In this paper we present a procedure to extract posts that contain experiential knowledge from Facebook discussions in Dutch, using automated filtering, manual annotations and machine learning. We define guidelines to annotate experiential knowledge and test them on a subset of the data. After several rounds of (re-)annotations, we come to an inter-annotator agreement of K=0.69, which reflects the difficulty of the task. We subsequently discuss inclusion and exclusion criteria to cope with the diversity of manifestations of experiential knowledge relevant to guideline development.

pdf bib abs

Identifying Copied Fragments in a 18th Century Dutch Chronicle
Roser Morante | Eleanor L. T. Smith | Lianne Wilhelmus | Alie Lassche | Erika Kuijpers
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We apply computational stylometric techniques to an 18th century Dutch chronicle to determine which fragments of the manuscript represent the author’s own original work and which show signs of external source use through either direct copying or paraphrasing. Through stylometric methods the majority of text fragments in the chronicle can be correctly labelled as either the author’s own words, direct copies from sources or paraphrasing. Our results show that clustering text fragments based on stylometric measures is an effective methodology for authorship verification of this document; however, this approach is less effective when personal writing style is masked by author independent styles or when applied to paraphrased text.

2021

pdf bib abs

Is Stance Detection Topic-Independent and Cross-topic Generalizable? - A Reproduction Study
Myrthe Reuver | Suzan Verberne | Roser Morante | Antske Fokkens
Proceedings of the 8th Workshop on Argument Mining

Cross-topic stance detection is the task to automatically detect stances (pro, against, or neutral) on unseen topics. We successfully reproduce state-of-the-art cross-topic stance detection work (Reimers et. al, 2019), and systematically analyze its reproducibility. Our attention then turns to the cross-topic aspect of this work, and the specificity of topics in terms of vocabulary and socio-cultural context. We ask: To what extent is stance detection topic-independent and generalizable across topics? We compare the model’s performance on various unseen topics, and find topic (e.g. abortion, cloning), class (e.g. pro, con), and their interaction affecting the model’s performance. We conclude that investigating performance on different topics, and addressing topic-specific vocabulary and context, is a future avenue for cross-topic stance detection. References Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, and Iryna Gurevych. 2019. Classification and Clustering of Arguments with Contextualized Word Embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 567–578, Florence, Italy. Association for Computational Linguistics.

pdf bib abs

The Early Modern Dutch Mediascape. Detecting Media Mentions in Chronicles Using Word Embeddings and CRF
Alie Lassche | Roser Morante
Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

While the production of information in the European early modern period is a well-researched topic, the question how people were engaging with the information explosion that occurred in early modern Europe, is still underexposed. This paper presents the annotations and experiments aimed at exploring whether we can automatically extract media related information (source, perception, and receiver) from a corpus of early modern Dutch chronicles in order to get insight in the mediascape of early modern middle class people from a historic perspective. In a number of classification experiments with Conditional Random Fields, three categories of features are tested: (i) raw and binary word embedding features, (ii) lexicon features, and (iii) character features. Overall, the classifier that uses raw embeddings performs slightly better. However, given that the best F-scores are around 0.60, we conclude that the machine learning approach needs to be combined with a close reading approach for the results to be useful to answer history research questions.

2020

pdf bib abs

Provenance for Linguistic Corpora through Nanopublications
Timo Lek | Anna de Groot | Tobias Kuhn | Roser Morante
Proceedings of the 14th Linguistic Annotation Workshop

Research in Computational Linguistics is dependent on text corpora for training and testing new tools and methodologies. While there exists a plethora of annotated linguistic information, these corpora are often not interoperable without significant manual work. Moreover, these annota-tions might have evolved into different versions, making it challenging for researchers to know the data’s provenance. This paper addresses this issue with a case study on event annotated corpora and by creating a new, more interoperable representation of this data in the form of nanopublications. We demonstrate how linguistic annotations from separate corpora can be reliably linked from the start, and thereby be accessed and queried as if they were a single dataset. We describe how such nanopublications can be created and demonstrate how SPARQL queries can be performed to extract interesting content from the new representations. The queries show that information of multiple corpora can be retrieved more easily and effectively because the information of different corpora is represented in a uniform data format.

pdf bib abs

Annotating Perspectives on Vaccination
Roser Morante | Chantal van Son | Isa Maks | Piek Vossen
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper we present the Vaccination Corpus, a corpus of texts related to the online vaccination debate that has been annotated with three layers of information about perspectives: attribution, claims and opinions. Additionally, events related to the vaccination debate are also annotated. The corpus contains 294 documents from the Internet which reflect different views on vaccinations. It has been compiled to study the language of online debates, with the final goal of experimenting with methodologies to extract and contrast perspectives in the framework of the vaccination debate.

pdf bib abs

Detecting Negation Cues and Scopes in Spanish
Salud María Jiménez-Zafra | Roser Morante | Eduardo Blanco | María Teresa Martín Valdivia | L. Alfonso Ureña López
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this work we address the processing of negation in Spanish. We first present a machine learning system that processes negation in Spanish. Specifically, we focus on two tasks: i) negation cue detection and ii) scope identification. The corpus used in the experimental framework is the SFU Corpus. The results for cue detection outperform state-of-the-art results, whereas for scope detection this is the first system that performs the task for Spanish. Moreover, we provide a qualitative error analysis aimed at understanding the limitations of the system and showing which negation cues and scopes are straightforward to predict automatically, and which ones are challenging.

pdf bib abs

Corpora Annotated with Negation: An Overview
Salud María Jiménez-Zafra | Roser Morante | María Teresa Martín-Valdivia | L. Alfonso Ureña-López
Computational Linguistics, Volume 46, Issue 1 - March 2020

Negation is a universal linguistic phenomenon with a great qualitative impact on natural language processing applications. The availability of corpora annotated with negation is essential to training negation processing systems. Currently, most corpora have been annotated for English, but the presence of languages other than English on the Internet, such as Chinese or Spanish, is greater every day. In this study, we present a review of the corpora annotated with negation information in several languages with the goal of evaluating what aspects of negation have been annotated and how compatible the corpora are. We conclude that it is very difficult to merge the existing corpora because we found differences in the annotation schemes used, and most importantly, in the annotation guidelines: the way in which each corpus was tokenized and the negation elements that have been annotated. Differently than for other well established tasks like semantic role labeling or parsing, for negation there is no standard annotation scheme nor guidelines, which hampers progress in its treatment.

pdf bib abs

Must Children be Vaccinated or not? Annotating Modal Verbs in the Vaccination Debate
Liza King | Roser Morante
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper we analyze the use of modal verbs in a corpus of texts related to the vaccination debate. Broadly speaking, the vaccination debate centers around whether vaccination is safe, and whether it is morally acceptable to enforce mandatory vaccination. In order to successfully intervene and curb the spread of preventable diseases due to low vaccination rates, health practitioners need to be adequately informed on public perception of the safety and necessity of vaccines. Public perception can relate to the strength of conviction that an individual may have towards a proposition (e.g. ‘one must vaccinate’ versus ‘one should vaccinate’), as well as qualify the type of proposition, be it related to morality (‘government should not interfere in my personal choice’) or related to possibility (‘too many vaccines at once could hurt my child’). Text mining and analysis of modal auxiliaries are economically viable means of gaining insights into these perspectives, particularly on a large scale due to the widespread use of social media and blogs as vehicles of communication.

Roser Morante

2025

2024

2022

2021

2020

2018

2017

2016

2015

2014

2012

2011

2010

2009

2008

2007

Co-authors

Venues