Ruslana Margova


2026

The work demonstrates how meaningful rhetorical signals can be isolated from a social media dataset even without pre-labelled data or predefined lexicons. By combining unsupervised mining with linguistic theory and interpretable machine learning, the research offers a scalable approach to understanding how language can shape political perception and behaviour in digital spaces.The study focuses on Bulgarian, a morphologically rich, relatively low-resource language, and produces reusable resources—alert constructions, post-level features, and trained classifiers—that are explicitly designed to support low-resource language modelling, including the training and evaluation of neural language models and LLMs for tasks such as content moderation and propaganda-alert detection. The finding that rhetorical salience, not just topical content, drives engagement has implications beyond Bulgarian: it suggests that how something is said may matter as much as what is said in determining a message’s viral potential and persuasive impact.

2025

Readability assessment plays a crucial role in education and text accessibility. While numerous indices exist for English and have been extended to Romance and Slavic languages, Bulgarian remains under- served in this regard. This paper reviews established readability metrics across these language families, examining their underlying features and modelling methods. We then report the first attempt to develop a readability index for Bulgarian, using end-of-school-year assessment questions and literary works targeted at children of various ages. Key linguistic attributes, namely, word length, sentence length, syllable count, and information content (based on word frequency), were extracted, and their first two statistical moments, mean and variance, were modelled against grade levels using linear and polynomial regression. Results suggest that polynomial models outperform linear ones by capturing non-linear relationships between textual features and perceived difficulty, but may be harder to interpret. This work provides an initial framework for building a reliable readability measure for Bulgarian, with applications in educational text design, adaptive learning, and corpus annotation.

2024

In this study we identify the most frequently used words and some multi-word expressions in the Bulgarian Parliament. We do this by using the transcripts of all plenary sessions between 1990 and 2024 - 3,936 in total. This allows us both to study an interesting period known in the Bulgarian linguistic space as the years of “transition and democracy”, and to provide scholars of Bulgarian politics with a purposefully generated list of additional stop words that they can use for future analysis. Because our list of words was generated from the data, there is no preconceived theory, and because we include all interactions during all sessions, our analysis goes beyond traditional party lines. We provide details of how we selected, retrieved, and cleaned our data, and discuss our findings.
This article introduces SM-FEEL-BG – the first Bulgarian-language package, containing 6 datasets with Social Media (SM) texts with emotion, feeling, and sentiment labels and 4 classifiers trained on them. All but one dataset from these are freely accessible for research purposes. The largest dataset contains 6000 Twitter, Telegram, and Facebook texts, manually annotated with 21 fine-grained emotion/feeling categories. The fine-grained labels are automatically merged into three coarse-grained sentiment categories, producing a dataset with two parallel sets of labels. Several classification experiments are run on different subsets of the fine-grained categories and their respective sentiment labels with a Bulgarian fine-tuned BERT. The highest Acc. reached was 0.61 for 16 emotions and 0.70 for 11 emotions (incl. 310 ChatGPT 4-generated texts). The sentiments Acc. of the 11 emotions dataset was also the highest (0.79). As Facebook posts cannot be shared, we ran experiments on the Twitter and Telegram subset of the 11 emotions dataset, obtaining 0.73 Acc. for emotions and 0.80 for sentiments. The article describes the annotation procedures, guidelines, experiments, and results. We believe that this package will be of significant benefit to researchers working on emotion detection and sentiment analysis in Bulgarian.

2023

Textual deepfakes can cause harm, especially on social media. At the moment, there are models trained to detect deepfake messages mainly for the English language, but no research or datasets currently exist for detecting them in most low-resource languages, such as Bulgarian. To address this gap, we explore three approaches. First, we machine translate an English-language social media dataset with bot messages into Bulgarian. However, the translation quality is unsatisfactory, leading us to create a new Bulgarian-language dataset with real social media messages and those generated by two language models (a new Bulgarian GPT-2 model – GPT-WEB-BG, and ChatGPT). We machine translate it into English and test existing English GPT-2 and ChatGPT detectors on it, achieving only 0.44-0.51 accuracy. Next, we train our own classifiers on the Bulgarian dataset, obtaining an accuracy of 0.97. Additionally, we apply the classifier with the highest results to a recently released Bulgarian social media dataset with manually fact-checked messages, which successfully identifies some of the messages as generated by Language Models (LM). Our results show that the use of machine translation is not suitable for textual deepfakes detection. We conclude that combining LM text detection with fact-checking is the most appropriate method for this task, and that identifying Bulgarian textual deepfakes is indeed possible.

2009