2024
pdf
bib
abs
SINAI at BioLaySumm: Self-Play Fine-Tuning of Large Language Models for Biomedical Lay Summarisation
Mariia Chizhikova
|
Manuel Carlos Díaz-Galiano
|
L. Alfonso Ureña-López
|
María-Teresa Martín-Valdivia
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing
An effective disclosure of scientific knowledge and advancements to the general public is often hindered by the complexity of the technical language used in research which often results very difficult, if not impossible, for non-experts to understand. In this paper we present the approach developed by the SINAI team as the result of our participation in BioLaySumm shared task hosted by the BioNLP workshop at ACL 2024. Our approach stems from the experimentation we performed in order to test the ability of state-of-the-art pre-trained large language models, namely GPT 3.5, GPT 4 and Llama-3, to tackle this task in a few-shot manner. In order to improve this baseline, we opted for fine-tuning Llama-3 by applying parameter-efficient methodologies. The best performing system which resulted from applying self-play fine tuning method which allows the model to improve while learning to distinguish between its own generations from the previous step from the gold standard summaries. This approach achieved 0.4205 ROUGE-1 score and 0.8583 BERTScore.
2020
pdf
bib
abs
Transfer learning applied to text classification in Spanish radiological reports
Pilar López Úbeda
|
Manuel Carlos Díaz-Galiano
|
L. Alfonso Urena Lopez
|
Maite Martin
|
Teodoro Martín-Noguerol
|
Antonio Luna
Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)
Pre-trained text encoders have rapidly advanced the state-of-the-art on many Natural Language Processing tasks. This paper presents the use of transfer learning methods applied to the automatic detection of codes in radiological reports in Spanish. Assigning codes to a clinical document is a popular task in NLP and in the biomedical domain. These codes can be of two types: standard classifications (e.g. ICD-10) or specific to each clinic or hospital. In this study we show a system using specific radiology clinic codes. The dataset is composed of 208,167 radiology reports labeled with 89 different codes. The corpus has been evaluated with three methods using the BERT model applied to Spanish: Multilingual BERT, BETO and XLM. The results are interesting obtaining 70% of F1-score with a pre-trained multilingual model.
2019
pdf
bib
abs
SINAI-DL at SemEval-2019 Task 5: Recurrent networks and data augmentation by paraphrasing
Arturo Montejo-Ráez
|
Salud María Jiménez-Zafra
|
Miguel A. García-Cumbreras
|
Manuel Carlos Díaz-Galiano
Proceedings of the 13th International Workshop on Semantic Evaluation
This paper describes the participation of the SINAI-DL team at Task 5 in SemEval 2019, called HatEval. We have applied some classic neural network layers, like word embeddings and LSTM, to build a neural classifier for both proposed tasks. Due to the small amount of training data provided compared to what is expected for an adequate learning stage in deep architectures, we explore the use of paraphrasing tools as source for data augmentation. Our results show that this method is promising, as some improvement has been found over non-augmented training sets.
pdf
bib
abs
SINAI-DL at SemEval-2019 Task 7: Data Augmentation and Temporal Expressions
Miguel A. García-Cumbreras
|
Salud María Jiménez-Zafra
|
Arturo Montejo-Ráez
|
Manuel Carlos Díaz-Galiano
|
Estela Saquete
Proceedings of the 13th International Workshop on Semantic Evaluation
This paper describes the participation of the SINAI-DL team at RumourEval (Task 7 in SemEval 2019, subtask A: SDQC). SDQC addresses the challenge of rumour stance classification as an indirect way of identifying potential rumours. Given a tweet with several replies, our system classifies each reply into either supporting, denying, questioning or commenting on the underlying rumours. We have applied data augmentation, temporal expressions labelling and transfer learning with a four-layer neural classifier. We achieve an accuracy of 0.715 with the official run over reply tweets.
pdf
bib
abs
Using Snomed to recognize and index chemical and drug mentions.
Pilar López Úbeda
|
Manuel Carlos Díaz Galiano
|
L. Alfonso Urena Lopez
|
Maite Martin
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks
In this paper we describe a new named entity extraction system. Our work proposes a system for the identification and annotation of drug names in Spanish biomedical texts based on machine learning and deep learning models. Subsequently, a standardized code using Snomed is assigned to these drugs, for this purpose, Natural Language Processing tools and techniques have been used, and a dictionary of different sources of information has been built. The results are promising, we obtain 78% in F1 score on the first sub-track and in the second task we map with Snomed correctly 72% of the found entities.
pdf
bib
abs
Using Machine Learning and Deep Learning Methods to Find Mentions of Adverse Drug Reactions in Social Media
Pilar López Úbeda
|
Manuel Carlos Díaz Galiano
|
Maite Martin
|
L. Alfonso Urena Lopez
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task
Over time the use of social networks is becoming very popular platforms for sharing health related information. Social Media Mining for Health Applications (SMM4H) provides tasks such as those described in this document to help manage information in the health domain. This document shows the first participation of the SINAI group. We study approaches based on machine learning and deep learning to extract adverse drug reaction mentions from Twitter. The results obtained in the tasks are encouraging, we are close to the average of all participants and even above in some cases.
pdf
bib
abs
Detecting Anorexia in Spanish Tweets
Pilar López Úbeda
|
Flor Miriam Plaza del Arco
|
Manuel Carlos Díaz Galiano
|
L. Alfonso Urena Lopez
|
Maite Martin
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Mental health is one of the main concerns of today’s society. Early detection of symptoms can greatly help people with mental disorders. People are using social networks more and more to express emotions, sentiments and mental states. Thus, the treatment of this information using NLP technologies can be applied to the automatic detection of mental problems such as eating disorders. However, the first step to solving the problem should be to provide a corpus in order to evaluate our systems. In this paper, we specifically focus on detecting anorexia messages on Twitter. Firstly, we have generated a new corpus of tweets extracted from different accounts including anorexia and non-anorexia messages in Spanish. The corpus is called SAD: Spanish Anorexia Detection corpus. In order to validate the effectiveness of the SAD corpus, we also propose several machine learning approaches for automatically detecting anorexia symptoms in the corpus. The good results obtained show that the application of textual classification methods is a promising option for developing this kind of system demonstrating that these tools could be used by professionals to help in the early detection of mental problems.
2016
pdf
bib
Pictogrammar: an AAC device based on a semantic grammar
Fernando Martínez-Santiago
|
Miguel Ángel García-Cumbreras
|
Arturo Montejo-Ráez
|
Manuel Carlos Díaz-Galiano
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications