Journal Computational Linguistics in Bulgaria

Anthology ID:: 2025.jclib-1
Month:: July
Year:: 2025
Address:: Sofia, Bulgaria
Venue:: JCLIB
SIG:
Publisher:: Institute for Bulgarian Language, Department of Computational Linguistics, Bulgarian Academy of Sciences
URL:: https://aclanthology.org/2025.jclib-1/
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2025.jclib-1.pdf

Journal Computational Linguistics in Bulgaria
Svetla Koeva

Computational Linguistics in Bulgaria
Svetla Koeva

The article introduces the journal Computational Linguistics in Bulgaria, an annual open access peer-reviewed journal published by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences. The relationship between the terms computational linguistics, natural language processing and artificial intelligence is briefly commented on in order to clarify the concept behind the journal’s name. The focus is then placed on the Bulgarian language and the Bulgarian research community, emphasising the importance of international contributions for the development of scientific cooperation and progress. The scope of the journal Computational Linguistics in Bulgaria is presented: It publishes articles on all areas of theoretical computational linguistics as well as on existing language resources, datasets and technologies for natural language processing and artificial intelligence. The journal promotes new approaches and methods, especially those aimed at applying language technologies to small and still resource-poor languages such as Bulgarian.

pdf bib abs

Does ChatGPT Adapt Itself to the Language Used and the Audience It Implies?
Iglika Nikolova Stoupak | Gaël Lejeune | Eva Schaeffer-Lacroix

This paper seeks to quantify and analyse the progress that ChatGPT has made from its GPT-3.5 (2022) to its GPT-4.5 (2025) version when it comes to answering prompts in a selection of differently-resourced languages: English, Bulgarian, Greek, French, Hebrew, Japanese and Russian. Factual correctness, textual quality and an answer’s linguistic and cultural independence from an English baseline are evaluated in the process. Each response is marked positively or negatively for each of the three metrics based on a set of defined criteria and careful humanbased analysis. In addition, three categories of questions are experimented with: general (e.g. communication assistance or request for jokes), perception-related (e.g. creative writing or explanation of physical processes) and geography-/culture-sensitive (questions in a specific language that address a particular, slightly sensitive topic related to the implied audience e.g. ’Why do French people eat snails?’). As hypothesised, the recent GPT-4.5 version demonstrates significant progress in all evaluated categories, thereby resolving past issues such as decreased textual quality of low-resourced languages and, notably, very limited variety in answers to the same question across languages. The metric ’Independence from the (English) Baseline’ receives 80.95% of positive marks in the GPT-4.5 version as opposed to 26.19% for GPT-3.5. Lingering problems include ChatGPT’s incomplete ability to generate relevant and culturally-sensitive jokes and poems.

pdf bib abs

Light Verb Constructions in ELEXIS-WSD – Annotation, Comparisons and Issues
Cvetana Krstev | Ranka Stanković | Aleksandra Marković

This paper deals with light verb constructions and their annotation in ELEXIS-sr, the Serbian extension of the ELEXIS-WSD corpus. In Section 1, general introductory remarks are given about these constructions, the notion of light verbs, and their treatment and further classification in the PARSEME annotation guidelines (subtypes LVC.full and LVC.cause). Section 2 offers an insight into ELEXIS-WSD corpus, annotated with VMWEs for several languages, with a remark that these VMWEs were not further subcategorised into finer classes. For this paper, we classified them ourselves to facilitate comparisons of the LVCs annotated in ELEXIS-sr. Tools and resources used for the automatic annotation of ELEXIS-sr are presented in Section 3, as well as the results of manual checking. In Section 4, we offer a comparison of LVCs in four ELEXIS-WSD sub-collections: Serbian, Bulgarian, Slovene, and English. We use Serbian as a starting point for this comparison, as it has been thoroughly annotated with MWEs (and NEs). We present the results of the comparison of all the occurrences of LVCs in the Serbian extension with their occurrences and annotation both in ELEXIS-WSD and Parseme sub-corpora for other languages. An important conclusion is that the most equivalents among LVCs are between Serbian and Bulgarian, closely related Slavic languages (a total of 34 equivalents), while between Serbian and Slovene, also Slavic, there are 11 equivalents, as between Serbian and English. It seems that this could be explained by the number of VMWES and LVCs annotated, or by the strategy used by different annotators.

pdf bib abs

Manual and automatic verification of the trustworthiness of information is an important task. Knowing whether the author of a statement was an eyewitness to the reported event(s) is a useful clue. In linguistics, such information is expressed through “evidentiality”. Evidentials are especially important in Bulgarian, as Bulgarian journalists often use a specific type of evidential (“renarrative”) to report events that they did not directly observe, nor verify. Unfortunately, there are no automatic tools to detect Bulgarian renarrative. This article presents the first two automatic solutions for this task. Specifically - a fine-tuned BERT classifier (renarrative BERT detector, BGRenBERT), achieving 0.98 Accuracy on the test split, and a renarrative rulebased detector (BGRenRules), created with regular expressions, matching a parser’s output. Both solutions detect Bulgarian texts containing the most frequently encountered forms of renarrative. Additionally, we compare the results of the two detectors with the manual annotation of subsets of two Bulgarian fake text datasets. BGRenRules obtains substantially higher results than BGRenBERT. The error analysis shows that the errors from BGRenRules most frequently correspond to cases in which humans also have doubts. The training dataset (BgRenData), the annotated dataset subsets, and the two detectors are made publicly accessible on Zenodo, GitHub, and HuggingFace. We expect that these new resources will be of invaluable assistance to 1) Bulgarian-language researchers, 2) researchers of other languages with similar phenomena, especially those working on verifying information.

pdf bib abs

The (Possible) Use of AI Tools for Processing Texts in Journalism in Bulgarian
Ruslana Margova

This study examines the technological gaps in text-processing AI tools available to Bulgarian-language journalists and how these tools might better support journalistic practices. Through a systematic analysis of current technologies across three key domains—monitoring and information gathering, content production, and content dissemination—the research reveals significant disparities between international standards and local capabilities. While some resources exist for Bulgarian journalism, including news aggregators, translation services, these tools often lack transparency, update infrequently, or provide insufficient functionality for professional journalistic needs. Large language models (LLMs) offer promising possibilities but remain underutilised in Bulgarian newsrooms. The article provides a case study about the practical use of AI. The study recommends strategic investment in language-specific AI, targeted training, transparency standards, and ethical frameworks to improve journalistic capacity and information quality in Bulgaria, as trustworthy journalism must reach wider audiences to drown out disinformation.