Nicolau Duran-Silva
2024
AffilGood: Building reliable institution name disambiguation tools to improve scientific literature analysis
Nicolau Duran-Silva
|
Pablo Accuosto
|
Piotr Przybyła
|
Horacio Saggion
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)
The accurate attribution of scientific works to research organizations is hindered by the lack of openly available manually annotated data–in particular when multilingual and complex affiliation strings are considered. The AffilGood framework introduced in this paper addresses this gap. We identify three sub-tasks relevant for institution name disambiguation and make available annotated datasets and tools aimed at each of them, including i) a dataset annotated with affiliation spans in noisy automatically-extracted strings; ii) a dataset annotated with named entities for the identification of organizations and their locations; iii) seven datasets annotated with the Research Organization Registry (ROR) identifiers for the evaluation of entity-linking systems. In addition, we describe, evaluate and make available newly developed tools that use these datasets to provide solutions for each of the identified sub-tasks. Our results confirm the value of the developed resources and methods in addressing key challenges in institution name disambiguation.
2023
A weakly supervised textual entailment approach to zero-shot text classification
Marc Pàmies
|
Joan Llop
|
Francesco Multari
|
Nicolau Duran-Silva
|
César Parra-Rojas
|
Aitor Gonzalez-Agirre
|
Francesco Alessandro Massucci
|
Marta Villegas
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Zero-shot text classification is a widely studied task that deals with a lack of annotated data. The most common approach is to reformulate it as a textual entailment problem, enabling classification into unseen classes. This work explores an effective approach that trains on a weakly supervised dataset generated from traditional classification data. We empirically study the relation between the performance of the entailment task, which is used as a proxy, and the target zero-shot text classification task. Our findings reveal that there is no linear correlation between both tasks, to the extent that it can be detrimental to lengthen the fine-tuning process even when the model is still learning, and propose a straightforward method to stop training on time. As a proof of concept, we introduce a domain-specific zero-shot text classifier that was trained on Microsoft Academic Graph data. The model, called SCIroShot, achieves state-of-the-art performance in the scientific domain and competitive results in other areas. Both the model and evaluation benchmark are publicly available on HuggingFace and GitHub.