Victoria Bobicev

2018

Using PPM for Health Related Text Detection
Victoria Bobicev | Victoria Lazu | Daniela Istrati
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task

This paper describes the participation of the LILU team in SMM4H challenge on social media mining for health related events description such as drug intakes or vaccinations.

pdf bib abs

Thumbs Up and Down: Sentiment Analysis of Medical Online Forums
Victoria Bobicev | Marina Sokolova
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task

In the current study, we apply multi-class and multi-label sentence classification to sentiment analysis of online medical forums. We aim to identify major health issues discussed in online social media and the types of sentiments those issues evoke. We use ontology of personal health information for Information Extraction and apply Machine Learning methods in automated recognition of the expressed sentiments.

2017

bib abs

Tools for Building a Corpus to Study the Historical and Geographical Variation of the Romanian Language
Victoria Bobicev | Cătălina Mărănduc | Cenel Augusto Perez
Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe

Contemporary standard language corpora are ideal for NLP. There are few morphologically and syntactically annotated corpora for Romanian, and those existing or in progress only deal with the Contemporary Romanian standard. However, the necessity to study the dynamics of natural languages gave rise to balanced corpora, containing non-standard texts. In this paper, we describe the creation of tools for processing non-standard Romanian to build a big balanced corpus. We want to preserve in annotated form as many early stages of language as possible. We have already built a corpus in Old Romanian. We also intend to include the South-Danube dialects, remote to the standard language, along with regional forms closer to the standard. We try to preserve data about endangered idioms such as Aromanian, Meglenoromanian and Istroromanian dialects, and calculate the distance between different regional variants, including the language spoken in the Republic of Moldova. This distance, as well as the mutual understanding between the speakers, is the correct criterion for the classification of idioms as different languages, or as dialects, or as regional variants close to the standard.

pdf bib

Syntactic Semantic Correspondence in Dependency Grammar
Cătălina Mărănduc | Cătălin Mititelu | Victoria Bobicev
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories

bib abs

Inter-Annotator Agreement in Sentiment Analysis: Machine Learning Perspective
Victoria Bobicev | Marina Sokolova
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Manual text annotation is an essential part of Big Text analytics. Although annotators work with limited parts of data sets, their results are extrapolated by automated text classification and affect the final classification results. Reliability of annotations and adequacy of assigned labels are especially important in the case of sentiment annotations. In the current study we examine inter-annotator agreement in multi-class, multi-label sentiment annotation of messages. We used several annotation agreement measures, as well as statistical analysis and Machine Learning to assess the resulting annotations.

2016

pdf bib abs

Automatic Detection of Arabicized Berber and Arabic Varieties
Wafia Adouane | Nasredine Semmar | Richard Johansson | Victoria Bobicev
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

Automatic Language Identification (ALI) is the detection of the natural language of an input text by a machine. It is the first necessary step to do any language-dependent natural language processing task. Various methods have been successfully applied to a wide range of languages, and the state-of-the-art automatic language identifiers are mainly based on character n-gram models trained on huge corpora. However, there are many languages which are not yet automatically processed, for instance minority and informal languages. Many of these languages are only spoken and do not exist in a written format. Social media platforms and new technologies have facilitated the emergence of written format for these spoken languages based on pronunciation. The latter are not well represented on the Web, commonly referred to as under-resourced languages, and the current available ALI tools fail to properly recognize them. In this paper, we revisit the problem of ALI with the focus on Arabicized Berber and dialectal Arabic short texts. We introduce new resources and evaluate the existing methods. The results show that machine learning models combined with lexicons are well suited for detecting Arabicized Berber and different Arabic varieties and distinguishing between them, giving a macro-average F-score of 92.94%.

The paper describes a method of word phonosemantics estimation. We treat phonosemantics as a subconscious emotional perception of word sounding independent on the word meaning. The method is based on the data about emotional perception of sounds obtained from a number of respondents. A program estimates words emotional characteristics using the data about sounds. The program output was compared with humans judgment. The results of the experiments showed that in most cases computer description of a word based on phonosemantic calculations is similar with our own impressions of the words sounding. On the other hand the word meaning dominates in emotional perception of the word and phonosemantic part comes out for the words with unknown meaning.