Nikola Tulechki


2021

pdf bib
Application of Deep Learning Methods to SNOMED CT Encoding of Clinical Texts: From Data Collection to Extreme Multi-Label Text-Based Classification
Anton Hristov | Aleksandar Tahchiev | Hristo Papazov | Nikola Tulechki | Todor Primov | Svetla Boytcheva
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Concept normalization of clinical texts to standard medical classifications and ontologies is a task with high importance for healthcare and medical research. We attempt to solve this problem through automatic SNOMED CT encoding, where SNOMED CT is one of the most widely used and comprehensive clinical term ontologies. Applying basic Deep Learning models, however, leads to undesirable results due to the unbalanced nature of the data and the extreme number of classes. We propose a classification procedure that features a multiple-step workflow consisting of label clustering, multi-cluster classification, and clusters-to-labels mapping. For multi-cluster classification, BioBERT is fine-tuned over our custom dataset. The clusters-to-labels mapping is carried out by a one-vs-all classifier (SVC) applied to every single cluster. We also present the steps for automatic dataset generation of textual descriptions annotated with SNOMED CT codes based on public data and linked open data. In order to cope with the problem that our dataset is highly unbalanced, some data augmentation methods are applied. The results from the conducted experiments show high accuracy and reliability of our approach for prediction of SNOMED CT codes relevant to a clinical text.

2019

pdf bib
Comparison of Machine Learning Approaches for Industry Classification Based on Textual Descriptions of Companies
Andrey Tagarev | Nikola Tulechki | Svetla Boytcheva
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

This paper addresses the task of categorizing companies within industry classification schemes. The datasets consists of encyclopedic articles about companies and their economic activities. The target classification schema is build by mapping linked open data in a semi-supervised manner. Target classes are build bottom-up from DBpedia. We apply several state of the art text classification techniques, based both on deep-learning and classical vector-space models.

2013

pdf bib
Second order similarity for exploring multilingual textual databases (Similarité de second ordre pour l’exploration de bases textuelles multilingues) [in French]
Nikola Tulechki | Ludovic Tanguy
Proceedings of TALN 2013 (Volume 2: Short Papers)

2012

pdf bib
Effacement de dimensions de similarité textuelle pour l’exploration de collections de rapports d’incidents aéronautiques (Deletion of dimensions of textual similarity for the exploration of collections of accident reports in aviation) [in French]
Nikola Tulechki | Ludovic Tanguy
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

2011

pdf bib
Des outils de TAL en support aux experts de sûreté industrielle pour l’exploitation de bases de données de retour d’expérience
Nikola Tulechki
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues

Cet article présente des applications d’outils et méthodes du traitement automatique des langues (TAL) à la maîtrise du risque industriel grâce à l’analyse de données textuelles issues de volumineuses bases de retour d’expérience (REX). Il explicite d’abord le domaine de la gestion de la sûreté, ses aspects politiques et sociaux ainsi que l’activité des experts en sûreté et les besoins qu’ils expriment. Dans un deuxième temps il présente une série de techniques, comme la classification automatique de documents, le repérage de subjectivité, et le clustering, adaptées aux données REX visant à répondre à ces besoins présents et à venir, sous forme d’outils, en support à l’activité des experts.