Sylvia Vassileva

2025

Using LLMs for Multilingual Clinical Entity Linking to ICD-10
Sylvia Vassileva | Ivan K. Koychev | Svetla Boytcheva
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

The linking of clinical entities is a crucial part of extracting structured information from clinical texts. It is the process of assigning a code from a medical ontology or classification to a phrase in the text. The International Classification of Diseases - 10th revision (ICD-10) is an international standard for classifying diseases for statistical and insurance purposes. Automatically assigning the correct ICD-10 code to terms in discharge summaries will simplify the work of healthcare professionals and ensure consistent coding in hospitals. Our paper proposes an approach for linking clinical terms to ICD-10 codes in different languages using Large Language Models (LLMs). The approach consists of a multistage pipeline that uses clinical dictionaries to match unambiguous terms in the text and then applies in-context learning with GPT-4.1 to predict the ICD-10 code for the terms that do not match the dictionary. Our system shows promising results in predicting ICD-10 codes on different benchmark datasets in Spanish - 0.89 F1 for categories and 0.78 F1 on subcategories on CodiEsp, and Greek - 0.85 F1 on ElCardioCC.

pdf bib abs

ExPe: Exact Positional Encodings for Generative Transformer Models with Extrapolating Capabilities
Aleksis Ioannis Datseris | Sylvia Vassileva | Ivan K. Koychev | Svetla Boytcheva
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

This paper introduces a novel approach to position embeddings in transformer models, named “Exact Positional Embeddings” (ExPE). An absolute positional embedding method that can extrapolate to sequences of lengths longer than the ones it was trained on. Traditional transformer models rely on absolute or relative position embeddings to incorporate positional information into token embeddings, which often struggle with extrapolation to sequences longer than those seen during training. Our proposed method utilizes a novel embedding strategy that encodes exact positional information by overriding specific dimensions of the embedding vectors, thereby enabling a more precise representation of token positions. The proposed approach not only maintains the integrity of the original embeddings but also enhances the model’s ability to generalize to longer sequences. In causal language modeling, our ExPE embeddings significantly reduce perplexity compared to rotary and sinusoidal embeddings, when tested on sequences longer than those used in training. The code and supplementary materials can be found in

2023

pdf bib abs

FMI-SU at SemEval-2023 Task 7: Two-level Entailment Classification of Clinical Trials Enhanced by Contextual Data Augmentation
Sylvia Vassileva | Georgi Grazhdanski | Svetla Boytcheva | Ivan Koychev
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The paper presents an approach for solving SemEval 2023 Task 7 - identifying the inference relation in a clinical trials dataset. The system has two levels for retrieving relevant clinical trial evidence for a statement and then classifying the inference relation based on the relevant sentences. In the first level, the system classifies the evidence-statement pairs as relevant or not using a BERT-based classifier and contextual data augmentation (subtask 2). Using the relevant parts of the clinical trial from the first level, the system uses an additional BERT-based classifier to determine whether the relation is entailment or contradiction (subtask 1). In both levels, the contextual data augmentation is showing a significant improvement in the F1 score on the test set of 3.7% for subtask 2 and 7.6% for subtask 1, achieving final F1 scores of 82.7% for subtask 2 and 64.4% for subtask 1.

2021

pdf bib abs

Vast amounts of data in healthcare are available in unstructured text format, usually in the local language of the countries. These documents contain valuable information. Secondary use of clinical narratives and information extraction of key facts and relations from them about the patient disease history can foster preventive medicine and improve healthcare. In this paper, we propose a hybrid method for the automatic transformation of clinical text into a structured format. The documents are automatically sectioned into the following parts: diagnosis, patient history, patient status, lab results. For the “Diagnosis” section a deep learning text-based encoding into ICD-10 codes is applied using MBG-ClinicalBERT - a fine-tuned ClinicalBERT model for Bulgarian medical text. From the “Patient History” section, we identify patient symptoms using a rule-based approach enhanced with similarity search based on MBG-ClinicalBERT word embeddings. We also identify symptom relations like negation. For the “Patient Status” description, binary classification is used to determine the status of each anatomic organ. In this paper, we demonstrate different methods for adapting NLP tools for English and other languages to a low resource language like Bulgarian.

pdf bib abs

Comparative Analysis of Fine-tuned Deep Learning Language Models for ICD-10 Classification Task for Bulgarian Language
Boris Velichkov | Sylvia Vassileva | Simeon Gerginov | Boris Kraychev | Ivaylo Ivanov | Philip Ivanov | Ivan Koychev | Svetla Boytcheva
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

The task of automatic diagnosis encoding into standard medical classifications and ontologies, is of great importance in medicine - both to support the daily tasks of physicians in the preparation and reporting of clinical documentation, and for automatic processing of clinical reports. In this paper we investigate the application and performance of different deep learning transformers for automatic encoding in ICD-10 of clinical texts in Bulgarian. The comparative analysis attempts to find which approach is more efficient to be used for fine-tuning of pretrained BERT family transformer to deal with a specific domain terminology on a rare language as Bulgarian. On the one side are used SlavicBERT and MultiligualBERT, that are pretrained for common vocabulary in Bulgarian, but lack medical terminology. On the other hand in the analysis are used BioBERT, ClinicalBERT, SapBERT, BlueBERT, that are pretrained for medical terminology in English, but lack training for language models in Bulgarian, and more over for vocabulary in Cyrillic. In our research study all BERT models are fine-tuned with additional medical texts in Bulgarian and then applied to the classification task for encoding medical diagnoses in Bulgarian into ICD-10 codes. Big corpora of diagnosis in Bulgarian annotated with ICD-10 codes is used for the classification task. Such an analysis gives a good idea of which of the models would be suitable for tasks of a similar type and domain. The experiments and evaluation results show that both approaches have comparable accuracy.

Co-authors

Aleksis Ioannis Datseris 1

Venues

RANLP4
SemEval1

Fix author