Jordi Armengol-Estapé


2022

pdf bib
Pretrained Biomedical Language Models for Clinical NLP in Spanish
Casimiro Pio Carrino | Joan Llop | Marc Pàmies | Asier Gutiérrez-Fandiño | Jordi Armengol-Estapé | Joaquín Silveira-Ocampo | Alfonso Valencia | Aitor Gonzalez-Agirre | Marta Villegas
Proceedings of the 21st Workshop on Biomedical Language Processing

This work presents the first large-scale biomedical Spanish language models trained from scratch, using large biomedical corpora consisting of a total of 1.1B tokens and an EHR corpus of 95M tokens. We compared them against general-domain and other domain-specific models for Spanish on three clinical NER tasks. As main results, our models are superior across the NER tasks, rendering them more convenient for clinical NLP applications. Furthermore, our findings indicate that when enough data is available, pre-training from scratch is better than continual pre-training when tested on clinical tasks, raising an exciting research question about which approach is optimal. Our models and fine-tuning scripts are publicly available at HuggingFace and GitHub.

2021

pdf bib
Transfer Learning with Shallow Decoders: BSC at WMT2021’s Multilingual Low-Resource Translation for Indo-European Languages Shared Task
Ksenia Kharitonova | Ona de Gibert Bonet | Jordi Armengol-Estapé | Mar Rodriguez i Alvarez | Maite Melero
Proceedings of the Sixth Conference on Machine Translation

This paper describes the participation of the BSC team in the WMT2021’s Multilingual Low-Resource Translation for Indo-European Languages Shared Task. The system aims to solve the Subtask 2: Wikipedia cultural heritage articles, which involves translation in four Romance languages: Catalan, Italian, Occitan and Romanian. The submitted system is a multilingual semi-supervised machine translation model. It is based on a pre-trained language model, namely XLM-RoBERTa, that is later fine-tuned with parallel data obtained mostly from OPUS. Unlike other works, we only use XLM to initialize the encoder and randomly initialize a shallow decoder. The reported results are robust and perform well for all tested languages.

pdf bib
Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan
Jordi Armengol-Estapé | Casimiro Pio Carrino | Carlos Rodriguez-Penagos | Ona de Gibert Bonet | Carme Armentano-Oller | Aitor Gonzalez-Agirre | Maite Melero | Marta Villegas
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Enriching the Transformer with Linguistic Factors for Low-Resource Machine Translation
Jordi Armengol-Estapé | Marta R. Costa-jussà | Carlos Escolano
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Introducing factors, that is to say, word features such as linguistic information referring to the source tokens, is known to improve the results of neural machine translation systems in certain settings, typically in recurrent architectures. This study proposes enhancing the current state-of-the-art neural machine translation architecture, the Transformer, so that it allows to introduce external knowledge. In particular, our proposed modification, the Factored Transformer, uses linguistic factors that insert additional knowledge into the machine translation system. Apart from using different kinds of features, we study the effect of different architectural configurations. Specifically, we analyze the performance of combining words and features at the embedding level or at the encoder level, and we experiment with two different combination strategies. With the best-found configuration, we show improvements of 0.8 BLEU over the baseline Transformer in the IWSLT German-to-English task. Moreover, we experiment with the more challenging FLoRes English-to-Nepali benchmark, which includes both extremely low-resourced and very distant languages, and obtain an improvement of 1.2 BLEU

2019

pdf bib
Medical Word Embeddings for Spanish: Development and Evaluation
Felipe Soares | Marta Villegas | Aitor Gonzalez-Agirre | Martin Krallinger | Jordi Armengol-Estapé
Proceedings of the 2nd Clinical Natural Language Processing Workshop

Word embeddings are representations of words in a dense vector space. Although they are not recent phenomena in Natural Language Processing (NLP), they have gained momentum after the recent developments of neural methods and Word2Vec. Regarding their applications in medical and clinical NLP, they are invaluable resources when training in-domain named entity recognition systems, classifiers or taggers, for instance. Thus, the development of tailored word embeddings for medical NLP is of great interest. However, we identified a gap in the literature which we aim to fill in this paper: the availability of embeddings for medical NLP in Spanish, as well as a standardized form of intrinsic evaluation. Since most work has been done for English, some established datasets for intrinsic evaluation are already available. In this paper, we show the steps we employed to adapt such datasets for the first time to Spanish, of particular relevance due to the considerable volume of EHRs in this language, as well as the creation of in-domain medical word embeddings for the Spanish using the state-of-the-art FastText model. We performed intrinsic evaluation with our adapted datasets, as well as extrinsic evaluation with a named entity recognition systems using a baseline embedding of general-domain. Both experiments proved that our embeddings are suitable for use in medical NLP in the Spanish language, and are more accurate than general-domain ones.