Todor Primov
2023
Clinical Text Classification to SNOMED CT Codes Using Transformers Trained on Linked Open Medical Ontologies
Anton Hristov
|
Petar Ivanov
|
Anna Aksenova
|
Tsvetan Asamov
|
Pavlin Gyurov
|
Todor Primov
|
Svetla Boytcheva
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
We present an approach for medical text coding with SNOMED CT. Our approach uses publicly available linked open data from terminologies and ontologies as training data for the algorithms. We claim that even small training corpora made of short text snippets can be used to train models for the given task. We propose a method based on transformers enhanced with clustering and filtering of the candidates. Further, we adopt a classical machine learning approach - support vector classification (SVC) using transformer embeddings. The resulting approach proves to be more accurate than the predictions given by Large Language Models. We evaluate on a dataset generated from linked open data for SNOMED codes related to morphology and topography for four use cases. Our transformers-based approach achieves an F1-score of 0.82 for morphology and 0.99 for topography codes. Further, we validate the applicability of our approach in a clinical context using labelled real clinical data that are not used for model training.
2021
Application of Deep Learning Methods to SNOMED CT Encoding of Clinical Texts: From Data Collection to Extreme Multi-Label Text-Based Classification
Anton Hristov
|
Aleksandar Tahchiev
|
Hristo Papazov
|
Nikola Tulechki
|
Todor Primov
|
Svetla Boytcheva
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Concept normalization of clinical texts to standard medical classifications and ontologies is a task with high importance for healthcare and medical research. We attempt to solve this problem through automatic SNOMED CT encoding, where SNOMED CT is one of the most widely used and comprehensive clinical term ontologies. Applying basic Deep Learning models, however, leads to undesirable results due to the unbalanced nature of the data and the extreme number of classes. We propose a classification procedure that features a multiple-step workflow consisting of label clustering, multi-cluster classification, and clusters-to-labels mapping. For multi-cluster classification, BioBERT is fine-tuned over our custom dataset. The clusters-to-labels mapping is carried out by a one-vs-all classifier (SVC) applied to every single cluster. We also present the steps for automatic dataset generation of textual descriptions annotated with SNOMED CT codes based on public data and linked open data. In order to cope with the problem that our dataset is highly unbalanced, some data augmentation methods are applied. The results from the conducted experiments show high accuracy and reliability of our approach for prediction of SNOMED CT codes relevant to a clinical text.
Search
Fix data
Co-authors
- Svetla Boytcheva 2
- Anton Hristov 2
- Anna Aksenova 1
- Tsvetan Asamov 1
- Pavlin Gyurov 1
- show all...