Ivelina Nikolova

2019

Proceedings of the Student Research Workshop Associated with RANLP 2019
Venelin Kovatchev | Irina Temnikova | Branislava Šandrih | Ivelina Nikolova
Proceedings of the Student Research Workshop Associated with RANLP 2019

2018

pdf bib abs

Tweety at SemEval-2018 Task 2: Predicting Emojis using Hierarchical Attention Neural Networks and Support Vector Machine
Daniel Kopev | Atanas Atanasov | Dimitrina Zlatkova | Momchil Hardalov | Ivan Koychev | Ivelina Nikolova | Galia Angelova
Proceedings of the 12th International Workshop on Semantic Evaluation

We present the system built for SemEval-2018 Task 2 on Emoji Prediction. Although Twitter messages are very short we managed to design a wide variety of features: textual, semantic, sentiment, emotion-, and color-related ones. We investigated different methods of text preprocessing including replacing text emojis with respective tokens and splitting hashtags to capture more meaning. To represent text we used word n-grams and word embeddings. We experimented with a wide range of classifiers and our best results were achieved using a SVM-based classifier and a Hierarchical Attention Neural Network.

2017

bib abs

Identification of Risk Factors in Clinical Texts through Association Rules
Svetla Boytcheva | Ivelina Nikolova | Galia Angelova | Zhivko Angelov
Proceedings of the Biomedical NLP Workshop associated with RANLP 2017

We describe a method which extracts Association Rules from texts in order to recognise verbalisations of risk factors. Usually some basic vocabulary about risk factors is known but medical conditions are expressed in clinical narratives with much higher variety. We propose an approach for data-driven learning of specialised medical vocabulary which, once collected, enables early alerting of potentially affected patients. The method is illustrated by experimens with clinical records of patients with Chronic Obstructive Pulmonary Disease (COPD) and comorbidity of CORD, Diabetes Melitus and Schizophrenia. Our input data come from the Bulgarian Diabetic Register, which is built using a pseudonymised collection of outpatient records for about 500,000 diabetic patients. The generated Association Rules for CORD are analysed in the context of demographic, gender, and age information. Valuable anounts of meaningful words, signalling risk factors, are discovered with high precision and confidence.

bib

Proceedings of the Student Research Workshop Associated with RANLP 2017
Venelin Kovatchev | Irina Temnikova | Pepa Gencheva | Yasen Kiprov | Ivelina Nikolova
Proceedings of the Student Research Workshop Associated with RANLP 2017

bib abs

Mining Association Rules from Clinical Narratives
Svetla Boytcheva | Ivelina Nikolova | Galia Angelova
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Shallow text analysis (Text Mining) uses mainly Information Extraction techniques. The low resource languages do not allow application of such traditional techniques with sufficient accuracy and recall on big data. In contrast, Data Mining approaches provide an opportunity to make deep analysis and to discover new knowledge. Frequent pattern mining approaches are used mainly for structured information in databases and are a quite challenging task in text mining. Unfortunately, most frequent pattern mining approaches do not use contextual information for extracted patterns: general patterns are extracted regardless of the context. We propose a method that processes raw informal texts (from health discussion forums) and formal texts (outpatient records) in Bulgarian language. In addition we use some context information and small terminological lexicons to generalize extracted frequent patterns. This allows to map informal expression of medical terminology to the formal one and to generate automatically resources.

2016

pdf bib

pdf bib abs

Finding Good Answers in Online Forums: Community Question Answering for Bulgarian
Tsvetomila Mihaylova | Ivan Koychev | Preslav Nakov | Ivelina Nikolova
Proceedings of the Second International Conference on Computational Linguistics in Bulgaria (CLIB 2016)

Community Question Answering (CQA) is a form of question answering that is getting increasingly popular as a research direction recently. Given a question posted in an online community forum and the thread of answers to it, a common formulation of the task is to rank automatically the answers, so that the good ones are ranked higher than the bad ones. Despite the vast research in CQA for English, very little attention has been paid to other languages. To bridge this gap, here we present our method for Community Question Answering in Bulgarian. We create annotated training and testing datasets for Bulgarian, and we further explore the applicability of machine translation for reusing English CQA data for building a Bulgarian system. The evaluation results show improvement over the baseline and can serve as a basis for further research.

2015

pdf bib

Voltron: A Hybrid System For Answer Validation Based On Lexical And Distance Features
Ivan Zamanov | Marina Kraeva | Nelly Hateva | Ivana Yovcheva | Ivelina Nikolova | Galia Angelova
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib

Proceedings of the Student Research Workshop
Irina Temnikova | Ivelina Nikolova | Alexander Popov
Proceedings of the Student Research Workshop

2014

bib abs

Sublanguage Corpus Analysis Toolkit: A tool for assessing the representativeness and sublanguage characteristics of corpora
Irina Temnikova | William A. Baumgartner Jr. | Negacy D. Hailu | Ivelina Nikolova | Tony McEnery | Adam Kilgarriff | Galia Angelova | K. Bretonnel Cohen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Sublanguages are varieties of language that form subsets of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed―English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.