David Pérez-Fernández

Also published as: David Perez Fernandez, David Pérez Fernández


2025

pdf bib
CASE: Large Scale Topic Exploitation for Decision Support Systems
Lorena Calvo Bartolomé | Jerónimo Arenas-García | David Pérez Fernández
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations

In recent years, there has been growing interest in using NLP tools for decision support systems, particularly in Science, Technology, and Innovation (STI). Among these, topic modeling has been widely used for analyzing large document collections, such as scientific articles, research projects, or patents, yet its integration into decision-making systems remains limited. This paper introduces CASE, a tool for exploiting topic information for semantic analysis of large corpora. The core of CASE is a Solr engine with a customized indexing strategy to represent information from Bayesian and Neural topic models that allow efficient topic-enriched searches. Through ad-hoc plug-ins, CASE enables topic inference on new texts and semantic search. We demonstrate the versatility and scalability of CASE through two use cases: the calculation of aggregated STI indicators and the implementation of a web service to help evaluate research projects.

2020

pdf bib
Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)
Doaa Samy | David Pérez-Fernández | Jerónimo Arenas-García
Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)

pdf bib
Legal-ES: A Set of Large Scale Resources for Spanish Legal Text Processing
Doaa Samy | Jerónimo Arenas-García | David Pérez-Fernández
Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)

Legal-ES is an open source resource kit for legal Spanish. It consists of a large scale Spanish corpus of open legal texts and different kinds of language models including word embeddings and topic models. The corpus includes over 1000 million words covering a collection of legislative and administrative open access documents in Spanish from different sources representing international, national and regional entities. The corpus is pre-processed and tokenized using Spacy. For the word embeddings, gensim was used on the collection of tokens, producing a representation space that is especially suited to reflect the inherent characteristics of the legal domain. We calculate also topic models to obtain a convenient tool to understand the main topics in the corpus and to navigate through the documents exploiting the semantic similarity among documents. We will analyse the time structure of a dynamic topic model to infer changes in the legal production of Spanish jurisdiction that have occurred over the analysed time framework.

2018

pdf bib
ELRI - European Language Resources Infrastructure
Thierry Etchegoyhen | Borja Anza Porras | Andoni Azpeitia | Eva Martínez Garcia | Paulo Vale | José Luis Fonseca | Teresa Lynn | Jane Dunne | Federico Gaspari | Andy Way | Victoria Arranz | Khalid Choukri | Vladimir Popescu | Pedro Neiva | Rui Neto | Maite Melero | David Perez Fernandez | Antonio Branco | Ruben Branco | Luis Gomes
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

We describe the European Language Resources Infrastructure project, whose main aim is the provision of an infrastructure to help collect, prepare and share language resources that can in turn improve translation services in Europe.