2025
pdf
bib
abs
Towards Creating a Bulgarian Readability Index
Dimitar Kazakov
|
Stefan Minkov
|
Ruslana Margova
|
Irina Temnikova
|
Ivo Emauilov
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
Readability assessment plays a crucial role in education and text accessibility. While numerous indices exist for English and have been extended to Romance and Slavic languages, Bulgarian remains under- served in this regard. This paper reviews established readability metrics across these language families, examining their underlying features and modelling methods. We then report the first attempt to develop a readability index for Bulgarian, using end-of-school-year assessment questions and literary works targeted at children of various ages. Key linguistic attributes, namely, word length, sentence length, syllable count, and information content (based on word frequency), were extracted, and their first two statistical moments, mean and variance, were modelled against grade levels using linear and polynomial regression. Results suggest that polynomial models outperform linear ones by capturing non-linear relationships between textual features and perceived difficulty, but may be harder to interpret. This work provides an initial framework for building a reliable readability measure for Bulgarian, with applications in educational text design, adaptive learning, and corpus annotation.
pdf
bib
abs
KGEIR: Knowledge Graph-Enhanced Iterative Reasoning for Multi-Hop Question Answering
Tianda Sun
|
Dimitar Kazakov
Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models
Multi-hop question answering (MHQA) requires systems to retrieve and connect information across multiple documents, a task where large language models often struggle. We introduce Knowledge Graph-Enhanced Iterative Reasoning (KGEIR), a framework that dynamically constructs and refines knowledge graphs during question answering to enhance multi-hop reasoning. KGEIR identifies key entities from questions, builds an initial graph from retrieved paragraphs, reasons over this structure, identifies information gaps, and iteratively retrieves additional context to refine the graph until sufficient information is gathered. Evaluations on HotpotQA, 2WikiMultiHopQA, and MuSiQue benchmarks show competitive or superior performance to state-of-the-art methods. Ablation studies confirm that structured knowledge representations significantly outperform traditional prompting approaches like Chain-of-Thought and Tree-of-Thought. KGEIR’s ability to explicitly model entity relationships while addressing information gaps through targeted retrieval offers a promising direction for integrating symbolic and neural approaches to complex reasoning tasks.
pdf
bib
abs
Mitigating Bias in Text Classification via Prompt-Based Text Transformation
Charmaine Barker
|
Dimitar Kazakov
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
The presence of specific linguistic signals particular to a certain sub-group can become highly salient to language models during training. In automated decision-making settings, this may lead to biased outcomes when models rely on cues that correlate with protected characteristics. We investigate whether prompting ChatGPT to rewrite text using simplification, neutralisation, localisation, and formalisation can reduce demographic signals while preserving meaning. Experimental results show a statistically significant drop in location classification accuracy across multiple models after transformation, suggesting reduced reliance on group-specific language. At the same time, sentiment analysis and rating prediction tasks confirm that the core meaning of the reviews remains greatly intact. These results suggest that prompt-based rewriting offers a practical and generalisable approach for mitigating bias in text classification.
2024
pdf
bib
abs
Meta-Evaluation of Sentence Simplification Metrics
Noof Abdullah Alfear
|
Dimitar Kazakov
|
Hend Al-Khalifa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Automatic Text Simplification (ATS) is one of the major Natural Language Processing (NLP) tasks, which aims to help people understand text that is above their reading abilities and comprehension. ATS models reconstruct the text into a simpler format by deletion, substitution, addition or splitting, while preserving the original meaning and maintaining correct grammar. Simplified sentences are usually evaluated by human experts based on three main factors: simplicity, adequacy and fluency or by calculating automatic evaluation metrics. In this paper, we conduct a meta-evaluation of reference-based automatic metrics for English sentence simplification using high-quality, human-annotated dataset, NEWSELA-LIKERT. We study the behavior of several evaluation metrics at sentence level across four different sentence simplification models. All the models were trained on the NEWSELA-AUTO dataset. The correlation between the metrics’ scores and human judgements was analyzed and the results used to recommend the most appropriate metrics for this task.
2017
pdf
bib
abs
Machine Learning Models of Universal Grammar Parameter Dependencies
Dimitar Kazakov
|
Guido Cordoni
|
Andrea Ceolin
|
Monica-Alexandrina Irimia
|
Shin-Sook Kim
|
Dimitris Michelioudakis
|
Nina Radkevich
|
Cristina Guardiano
|
Giuseppe Longobardi
Proceedings of the Workshop Knowledge Resources for the Socio-Economic Sciences and Humanities associated with RANLP 2017
The use of parameters in the description of natural language syntax has to balance between the need to discriminate among (sometimes subtly different) languages, which can be seen as a cross-linguistic version of Chomsky’s (1964) descriptive adequacy, and the complexity of the acquisition task that a large number of parameters would imply, which is a problem for explanatory adequacy. Here we present a novel approach in which a machine learning algorithm is used to find dependencies in a table of parameters. The result is a dependency graph in which some of the parameters can be fully predicted from others. These empirical findings can be then subjected to linguistic analysis, which may either refute them by providing typological counter-examples of languages not included in the original dataset, dismiss them on theoretical grounds, or uphold them as tentative empirical laws worth of further study.
pdf
bib
abs
Building Dialectal Arabic Corpora
Hani Elgabou
|
Dimitar Kazakov
Proceedings of the Workshop Human-Informed Translation and Interpreting Technology
The aim of this research is to identify local Arabic dialects in texts from social media (Twitter) and link them to specific geographic areas. Dialect identification is studied as a subset of the task of language identification. The proposed method is based on unsupervised learning using simultaneously lexical and geographic distance. While this study focusses on Libyan dialects, the approach is general, and could produce resources to support human translators and interpreters when dealing with vernaculars rather than standard Arabic.
2013
pdf
bib
Using Parallel Corpora for Word Sense Disambiguation
Dimitar Kazakov
|
Ahmad R. Shahid
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013
2009
pdf
bib
Unsupervised Construction of a Multilingual WordNet from Parallel Corpora
Dimitar Kazakov
|
Ahmad R. Shahid
Proceedings of the Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning
2004
pdf
bib
WordNet-based text document clustering
Julian Sedding
|
Dimitar Kazakov
Proceedings of the 3rd workshop on RObust Methods in Analysis of Natural Language Data (ROMAND 2004)