Petya Osenova

2025

Word Sense Disambiguation with Large Language Models: Casing Bulgarian
Nikolay Paev | Kiril Simov | Petya Osenova
Proceedings of the 13th Global Wordnet Conference

pdf bib abs

Bulgarian Event Extraction with LLMs
Kiril Simov | Nikolay Paev | Petya Osenova | Stefan Marinov
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

The paper presents the results from the experiments with two large language models (LLMs) - T5 and Llama – for extracting events from a Bulgarian event corpus. The two models were pretrained by us on 35 Billion Token Bulgarian Corpus. The extraction was performed within the context of one sentence. Our approach aims at balancing the ACE-oriented approach that uses triggers in event detection, and the MUC-oriented one that uses more general event types. The evaluation relies on the IoU (Intersection over Union) of token spans and is twofold. The first one refers to the predicted event token span. Here if the span is correct, the semantic roles within the event are further checked. The second one refers to the triple of an event type, its semantic roles and participants. The results are promising. A qualitative evaluation is provided as well.

2024

pdf bib abs

Bulgarian ParlaMint 4.0 corpus as a testset for Part-of-speech tagging and Named Entity Recognition
Petya Osenova | Kiril Simov
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

The paper discusses some fine-tuned models for the tasks of part-of-speech tagging and named entity recognition. The fine-tuning was performed on the basis of an existing BERT pre-trained model and two newly pre-trained BERT models for Bulgarian that are cross-tested on the domain of the Bulgarian part of the ParlaMint corpora as a new domain. In addition, a comparison has been made between the performance of the new fine-tuned BERT models and the available results from the Stanza-based model which the Bulgarian part of the ParlaMint corpora has been annotated with. The observations show the weaknesses in each model as well as the common challenges.

pdf bib abs

On a Hurtlex Resource for Bulgarian
Petya Osenova
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)

The paper reports on the cleaning of the Hurtlex lexicon for Bulgarian as part of the multilingual Hurtlex resource. All the challenges during the cleaning process are presented, such as: deleting strings or lexica that are clear errors from the automatic translation, establishing criteria for keeping or discarding a lexeme based on its meaning and potential usages, contextualizing the lexeme with the meaning through an example, etc. In addition, the paper discusses the mapping of the offensive lexica to the BTB-Wordnet as well as the system that has been used.

pdf bib abs

Introducing Shallow Syntactic Information within the Graph-based Dependency Parsing
Nikolay Paev | Kiril Simov | Petya Osenova
Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024)

The paper presents a new BERT model, fine-tuned for parsing of Bulgarian texts. This model is extended with a new neural network layer in order to incorporate shallow syntactic information during the training phase. The results show statistically significant improvement over the baseline. Thus, the addition of syntactic knowledge - even partial - makes the model better. Also, some error analysis has been conducted on the results from the parsers. Although the architecture has been designed and tested for Bulgarian, it is also scalable for other languages. This scalability was shown here with some experiments and evaluation on an English treebank with a comparable size.

pdf bib abs

We present ongoing work towards defining a lexicon-corpus interface to serve as a benchmark in the representation of multiword expressions (of various parts of speech) in dedicated lexica and the linking of these entries to their corpus occurrences. The final aim is the harnessing of such resources for the automatic identification of multiword expressions in a text. The involvement of several natural languages aims at the universality of a solution not centered on a particular language, and also accommodating idiosyncrasies. Challenges in the lexicographic description of multiword expressions are discussed, the current status of lexica dedicated to this linguistic phenomenon is outlined, as well as the solution we envisage for creating an ecosystem of interlinked lexica and corpora containing and, respectively, annotated with multiword expressions.

2023

pdf bib abs

We present bgGLUE (Bulgarian General Language Understanding Evaluation), a benchmark for evaluating language models on Natural Language Understanding (NLU) tasks in Bulgarian. Our benchmark includes NLU tasks targeting a variety of NLP problems (e.g., natural language inference, fact-checking, named entity recognition, sentiment analysis, question answering, etc.) and machine learning tasks (sequence labeling, document-level classification, and regression). We run the first systematic evaluation of pre-trained language models for Bulgarian, comparing and contrasting results across the nine tasks in the benchmark. The evaluation results show strong performance on sequence labeling tasks, but there is a lot of room for improvement for tasks that require more complex reasoning. We make bgGLUE publicly available together with the fine-tuning and the evaluation code, as well as a public leaderboard at https://bgglue.github.io, and we hope that it will enable further advancements in developing NLU models for Bulgarian.

pdf bib abs

Recent Developments in BTB-WordNet
Kiril Simov | Petya Osenova
Proceedings of the 12th Global Wordnet Conference

The paper reports on recent developments in Bulgarian BTB-WordNet (BTB-WN). This resource is viewed as playing a central role with respect to the integration and interlinking of various language resources such as: e-dictionaries (morphological, terminological, bilingual, orthographic, etymological and explanatory, etc., including editions from previous periods); corpora (coming from outside or being internal - like the corpus of definitions as well as the corpus of examples to synset meanings); ontologies (such as CIDOC-CRM, DBpedia, etc.); sources of world knowledge (such as information from the Bulgarian Encyclopedia, Wikipedia, etc.). The paper also gives information about a number of applications built on BTB-WN. These are: the Bulgaria-centered knowledge graph, the All about word application as well as some education-oriented exercises.

pdf bib abs

Transformer-Based Language Models for Bulgarian
Iva Marinova | Kiril Simov | Petya Osenova
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

This paper presents an approach for training lightweight and robust language models for Bulgarian that mitigate gender, political, racial, and other biases in the data. Our method involves scraping content from major Bulgarian online media providers using a specialized procedure for source filtering, topic selection, and lexicon-based removal of inappropriate language during the pre-training phase. We continuously improve the models by incorporating new data from various domains, including social media, books, scientific literature, and linguistically modified corpora. Our motivation is to provide a solution that is sufficient for all natural language processing tasks in Bulgarian, and to address the lack of existing procedures for guaranteeing the robustness of such models.

2022

pdf bib abs

The Bulgarian Event Corpus: Overview and Initial NER Experiments
Petya Osenova | Kiril Simov | Iva Marinova | Melania Berbatova
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The paper describes the Bulgarian Event Corpus (BEC). The annotation scheme is based on CIDOC-CRM ontology and on the English Framenet, adjusted for our task. It includes two main layers: named entities and events with their roles. The corpus is multi-domain and mainly oriented towards Social Sciences and Humanities (SSH). It will be used for: extracting knowledge and making it available through the Bulgaria-centric Knowledge Graph; further developing an annotation scheme that handles multiple domains in SSH; training automatic modules for the most important knowledge-based tasks, such as domain-specific and nested NER, NEL, event detection and profiling. Initial experiments were conducted on standard NER task due to complexity of the dataset and the rich NE annotation scheme. The results are promising with respect to some labels and give insights on handling better other ones. These experiments serve also as error detection modules that would help us in scheme re-design. They are a basis for further and more complex tasks, such as nested NER, NEL and event detection.

pdf bib abs

In ParlaMint I, a CLARIN-ERIC supported project in pandemic times, a set of comparable and uniformly annotated multilingual corpora for 17 national parliaments were developed and released in 2021. For 2022 and 2023, the project has been extended to ParlaMint II, again with the CLARIN ERIC financial support, in order to enhance the existing corpora with new data and metadata; upgrade the XML schema; add corpora for 10 new parliaments; provide more application scenarios and carry out additional experiments. The paper reports on these planned steps, including some that have already been taken, and outlines future plans.

pdf bib abs

Raising and Control Constructions in a Bulgarian UD Parsebank of Parliament Sessions
Petya Osenova
Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)

The paper discusses the raising and control syntactic structures (marked as ‘xcomp’) in a UD parsed corpus of Bulgarian Parliamentary Sessions. The idea is: to investigate the linguistic status of this phenomenon in an automatically parsed corpus, with a focus on verbal constructions of a head and its dependant together with the shared subject; to detect the errors and get insights on how to improve the annotation scheme and the automatic detection of this phenomenon realizations in Bulgarian.

Petya Osenova

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2004

2003

2002

Co-authors

Venues