Medical Word Embeddings for Spanish: Development and Evaluation

Felipe Soares, Marta Villegas, Aitor Gonzalez-Agirre, Martin Krallinger, Jordi Armengol-Estapé


Abstract
Word embeddings are representations of words in a dense vector space. Although they are not recent phenomena in Natural Language Processing (NLP), they have gained momentum after the recent developments of neural methods and Word2Vec. Regarding their applications in medical and clinical NLP, they are invaluable resources when training in-domain named entity recognition systems, classifiers or taggers, for instance. Thus, the development of tailored word embeddings for medical NLP is of great interest. However, we identified a gap in the literature which we aim to fill in this paper: the availability of embeddings for medical NLP in Spanish, as well as a standardized form of intrinsic evaluation. Since most work has been done for English, some established datasets for intrinsic evaluation are already available. In this paper, we show the steps we employed to adapt such datasets for the first time to Spanish, of particular relevance due to the considerable volume of EHRs in this language, as well as the creation of in-domain medical word embeddings for the Spanish using the state-of-the-art FastText model. We performed intrinsic evaluation with our adapted datasets, as well as extrinsic evaluation with a named entity recognition systems using a baseline embedding of general-domain. Both experiments proved that our embeddings are suitable for use in medical NLP in the Spanish language, and are more accurate than general-domain ones.
Anthology ID:
W19-1916
Volume:
Proceedings of the 2nd Clinical Natural Language Processing Workshop
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota, USA
Editors:
Anna Rumshisky, Kirk Roberts, Steven Bethard, Tristan Naumann
Venue:
ClinicalNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
124–133
Language:
URL:
https://aclanthology.org/W19-1916
DOI:
10.18653/v1/W19-1916
Bibkey:
Cite (ACL):
Felipe Soares, Marta Villegas, Aitor Gonzalez-Agirre, Martin Krallinger, and Jordi Armengol-Estapé. 2019. Medical Word Embeddings for Spanish: Development and Evaluation. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 124–133, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
Cite (Informal):
Medical Word Embeddings for Spanish: Development and Evaluation (Soares et al., ClinicalNLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-1916.pdf