Jerónimo Arenas-García

Also published as: Jerónimo Arenas-garcía


2023

pdf bib
ITMT: Interactive Topic Model Trainer
Lorena Calvo Bartolomé | José Antonio Espinosa Melchor | Jerónimo Arenas-garcía
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Topic Modeling is a commonly used technique for analyzing unstructured data in various fields, but achieving accurate results and useful models can be challenging, especially for domain experts who lack the knowledge needed to optimize the parameters required by this natural language processing technique. From this perspective, we introduce an Interactive Topic Model Trainer (ITMT) developed within the EU-funded project IntelComp. ITMT is a user-in-the-loop topic modeling tool presented with a graphical user interface that allows the training and curation of different state-of-the-art topic extraction libraries, including some recent neural-based methods, oriented toward the usage by domain experts. This paper reviews ITMT’s functionalities and key implementation aspects in this paper, including a comparison with other tools for topic modeling analysis.

2020

pdf bib
Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)
Doaa Samy | David Pérez-Fernández | Jerónimo Arenas-García
Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)

pdf bib
Legal-ES: A Set of Large Scale Resources for Spanish Legal Text Processing
Doaa Samy | Jerónimo Arenas-García | David Pérez-Fernández
Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)

Legal-ES is an open source resource kit for legal Spanish. It consists of a large scale Spanish corpus of open legal texts and different kinds of language models including word embeddings and topic models. The corpus includes over 1000 million words covering a collection of legislative and administrative open access documents in Spanish from different sources representing international, national and regional entities. The corpus is pre-processed and tokenized using Spacy. For the word embeddings, gensim was used on the collection of tokens, producing a representation space that is especially suited to reflect the inherent characteristics of the legal domain. We calculate also topic models to obtain a convenient tool to understand the main topics in the corpus and to navigate through the documents exploiting the semantic similarity among documents. We will analyse the time structure of a dynamic topic model to infer changes in the legal production of Spanish jurisdiction that have occurred over the analysed time framework.