Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)

Anthology ID:: 2024.clib-1
Month:: September
Year:: 2024
Address:: Sofia, Bulgaria
Venue:: CLIB
SIG:
Publisher:: Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences
URL:: https://aclanthology.org/2024.clib-1/
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2024.clib-1.pdf

pdf bib
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)

pdf bib abs
A Cross-model Study on Learning Romanian Parts of Speech with Transformer Models
Radu Ion | Verginica Barbu Mititelu | Vasile Păiş | Elena Irimia | Valentin Badea

This paper will attempt to determine experimentally if POS tagging of unseen words produces comparable performance, in terms of accuracy, as for words that were rarely seen in the training set (i.e. frequency less than 5), or more frequently seen (i.e. frequency greater than 10). To compare accuracies objectively, we will use the odds ratio statistic and its confidence interval testing to show that odds of being correct on unseen words are close to odds of being correct on rarely seen words. For the training of the POS taggers, we use different Romanian BERT models that are freely available on HuggingFace.

pdf bib abs
What do BERT Word Embeddings Learn about the French Language?
Ekaterina Goliakova | David Langlois

Pre-trained word embeddings (for example, BERT-like) have been successfully used in a variety of downstream tasks. However, do all embeddings, obtained from the models of the same architecture, encode information in the same way? Does the size of the model correlate to the quality of the information encoding? In this paper, we will attempt to dissect the dimensions of several BERT-like models that were trained on the French language to find where grammatical information (gender, plurality, part of speech) and semantic features might be encoded. In addition to this, we propose a framework for comparing the quality of encoding in different models.

pdf bib abs
Whisper–TAD: A General Model for Transcription, Alignment and Diarization of Speech
Camille Lavigne | Alex Stasica

Currently, there is a lack of a straightforward implementation of diarization-augmented speech transcription (DAST), ie. implementation of transcription, diarization and alignment to the audio within one model. These tasks typically require distinct models, necessitating to stack them together for complete processing. In this study, we advocate for leveraging the advanced capabilities of the Whisper models, which already excels in automatic transcription and partial alignment. Our approach involves fine-tuning the model’s parameters on both transcription and diarization tasks in a SOT-FIFO (Serialized Output Training-First In First Out) manner. This comprehensive framework facilitates the creation of orthographic transcriptions, identification of speakers, and precise alignment, thus enhancing the efficiency of audio processing workflows. While our work represents an initial step towards a unified transcription and diarization framework, the development of such a model demands substantial high-quality data augmentation and computational resources beyond our current scope. Consequently, our focus is narrowed to the English language. Despite these limitations, our method demonstrates promising performance in both transcription and diarization tasks. Comparative analysis between pre-trained models and fine-tuned TAD (Transcription, Alignment, Diarization) versions suggests that incorporating diarization into a Whisper model doesn’t compromise transcription accuracy. Our findings hint that deploying our TAD framework on the largest Whisper model could potentially yield state-of-the-art performance across all mentioned tasks.

pdf bib abs
Contemporary LLMs and Literary Abridgement: An Analytical Inquiry
Iglika Nikolova-Stoupak | Gaël Lejeune | Eva Schaeffer-Lacroix

Within the framework of this study, several contemporary Large Language Models (ChatGPT, Gemini Pro, Mistral-Instruct and BgGPT) are evaluated in relation to their ability to generate abridged versions of literary texts. The analysis is based on ’The Ugly Duckling’ by H. C. Andersen as translated into English, French and Bulgarian. The different scenarios of abridgement experimented with include zero-shot, one-shot, division into chunks and crosslingual (including chain-of-thought) abridgement. The resulting texts are evaluated both automatically and via human evaluation. The automatic analysis includes ROUGE and BERTScore as well as the ratios of a selection of readability-related textual features (e.g. number of words, type-to-token ratio) as pertaining to the original versus automatically abridged texts. Professionally composed abridged versions are regarded as gold standard. Following the automatic analysis, six selected best candidate texts per language are then evaluated by volunteers with university education in terms of textual characteristics of a more qualitative nature, such as coherence, consistency and aesthetic appeal.

pdf bib abs
Advancing Sentiment Analysis in Serbian Literature: A Zero and Few–Shot Learning Approach Using the Mistral Model
Milica Ikonić Nešić | Saša Petalinkar | Mihailo Škorić | Ranka Stanković | Biljana Rujević

This study presents the Sentiment Analysis of the Serbian old novels from the 1840-1920 period, employing the Mistral Large Language Model (LLM) to pioneer zero and few-shot learning techniques. The main approach innovates by devising research prompts that include guidance text for zero-shot classification and examples for few-shot learning, enabling the LLM to classify sentiments into positive, negative, or objective categories. This methodology aims to streamline sentiment analysis by limiting responses, thereby enhancing classification precision. Python, along with the Hugging Face Transformers and LangChain libraries, serves as our technological backbone, facilitating the creation and refinement of research prompts tailored for sentence-level sentiment analysis. The results of sentiment analysis in both scenarios, zero-shot and few-shot, have indicated that the zero-shot approach outperforms, achieving an accuracy of 68.2%.

pdf bib abs
Generating Phonetic Embeddings for Bulgarian Words with Neural Networks
Lyuboslav Karev | Ivan Koychev

Word embeddings can be considered the cornerstone of modern natural language processing. They are used in many NLP tasks and allow us to create models that can understand the meaning of words. Most word embeddings model the semantics of the words. In this paper, we create phoneme-based word embeddings, which model how a word sounds. This is accomplished by training a neural network that can automatically generate transcriptions of Bulgarian words. We used the Jaccard index and direct comparison metrics to measure the performance of neural networks. The models perform nearly perfectly with the task of generating transcriptions. The model’s word embeddings offer versatility across various applications, with its application in automatic paronym detection being particularly notable, as well as the task of detecting the language of origin of a Bulgarian word. The performance of this paronym detection is measured with the standard classifier metrics - accuracy, precision, recall, and F1.

In this paper, we present a Universal Dependencies (UD) treebank for the Standard Albanian Language (SAL), annotated by expert linguistics supported by information technology professionals. The annotated treebank consists of 24,537 tokens (1,400 sentences) and includes annotation for syntactic dependencies, part-of-speech tags, morphological features, and lemmas. This treebank represents the largest UD treebank available for SAL. In order to overcome annotation challenges in SAL within the UD framework, we delicately balanced the preservation of the richness of SAL grammar while adapting the UD tagset and addressing unique language-specific features for a unified annotation. We discuss the criteria followed to select the sentences included in the treebank and address the most significant linguistic considerations when adapting the UD framework conform to the grammar of the SAL. Our efforts contribute to the advancement of linguistic analyses and Natural Language Processing (NLP) in the SAL. The treebank will be made available online under an open license so that to provide the possibility for further developments of NLP tools based on the Artificial Intelligence (AI) models for the Albanian language.

pdf bib abs
Function Multiword Expressions Annotated with Discourse Relations in the Romanian Reference Treebank
Verginica Barbu Mititelu | Tudor Voicu

For the Romanian Reference Treebank, a general language corpus, covering several genres and annotated according to the principles of Universal Dependencies, we present here the annotation of some function words, namely multiword conjunctions, with discourse relations from the Penn Discourse Treebank version 3.0 inventory of such relations. The annotation process was manual, with two annotators for each occurrence of the conjunctions. Lexical-semantic relations of the types synonymy, polysemy can be established between the senses of such conjunctions. The discourse relations are added to the CoNLL-U file in which the treebank is represented.

pdf bib abs
Dependency Parser for Bulgarian
Atanas Atanasov

This paper delves into the implementation of a Biaffine Attention Model, a sophisticated neural network architecture employed for dependency parsing tasks. Proposed by Dozat and Manning, this model is applied to Bulgarian language processing. The model’s training and evaluation are conducted using the Bulgarian Universal Dependencies dataset. The paper offers a comprehensive explanation of the model’s architecture and the data preparation process, aiming to demonstrate that for highly inflected languages, the inclusion of two additional input layers - lemmas and language-specific morphological information - is beneficial. The results of the experiments are subsequently presented and discussed. The paper concludes with a reflection on the model’s performance and suggestions for potential future work.

pdf bib abs
Towards a Romanian Phrasal Academic Lexicon
Madalina Chitez | Ana-Maria Bucur | Andreea Dinca | Roxana Rogobete

The lack of NLP based research studies on academic writing in Romania results in an unbalanced development of automatic support tools in Romanian compared to other languages, such as English. For this study, we use Romanian subsets of two bilingual academic writing corpora: the ROGER corpus, consisting of university student papers, and the EXPRES corpus, composed of expert research articles. Working with the Romanian Academic Word List / RoAWL, we present two phrase extraction phases: (i) use Ro-AWL words as node words to extract collocations according to the thresholds of statistical measures and (ii) classify extracted phrases into general versus domain-specific multi-word units. We show how manual rhetorical function annotation of resulting phrases can be combined with automatic function detection. The comparison between academic phrases in ROGER and EXPRES validates the final phrase list. The Romanian phrasal academic lexicon (ROPAL), similar to the Oxford Phrasal Academic Lexicon (OPAL), is a written academic phrase lexicon for Romanian language made available for academic use and further research or applications.

The electronic dictionary Tēzaurs.lv contains more than 400,000 entries from which 73,000 entries are multi-word expressions (MWEs). Over the past two years, there has been an ongoing division of these MWEs into subgroups (proper names, multi-word terms, taxa, phraseological units, collocations). The article describes the classification of MWEs, focusing on phraseological units (approximately 7,250 entries), as well as on borderline cases of phraseological unit types (phrasemes and idioms) and different MWE groups in general. The division of phraseological units depends on semantic divisibility and figurativeness. In a phraseme, at least one of the constituents retains its literal sense, whereas the meaning of an idiom is not dependent on the literal sense of any of its constituents. As a result, 65919 entries of MWE have been manually classified, and now this information of MWE type is available for the users of the electronic dictionary Tēzaurs.lv.

pdf bib abs
Complex Word Identification for Italian Language: A Dictionary–based Approach
Laura Occhipinti

Assessing word complexity in Italian poses significant challenges, particularly due to the absence of a standardized dataset. This study introduces the first automatic model designed to identify word complexity for native Italian speakers. A dictionary of simple and complex words was constructed, and various configurations of linguistic features were explored to find the best statistical classifier based on Random Forest algorithm. Considering the probabilities of a word to belong to a class, a comparison between the models’ predictions and human assessments derived from a dataset annotated for complexity perception was made. Finally, the degree of accord between the model predictions and the human inter-annotator agreement was analyzed using Spearman correlation. Our findings indicate that a model incorporating both linguistic features and word embeddings performed better than other simpler models, also showing a value of correlation with the human judgements similar to the inter-annotator agreement. This study demonstrates the feasibility of an automatic system for detecting complexity in the Italian language with good performances and comparable effectiveness to humans in this subjective task.

pdf bib abs
Verbal Multiword Expressions in the Croatian Verb Lexicon
Ivana Brač | Matea Birtić

The paper examines the complexities of encoding verbal multiword expressions in the Croatian verb lexicon. The lexicon incorporates a verb’s description at the syntactic, morphological, and semantic levels. This study explores the treatment of reflexive verbs, light verb constructions, and verbal idioms across several Croatian and Slavic language resources to find the best solution for the verb lexicon. It addresses the following research questions: 1. How should reflexive verbs, i.e., verbs with the reflexive marker se, be treated? Should they be considered as separate lemmas, sublemmas of non-reflexive counterparts, or as one of their senses? 2. What syntactic label and semantic role should be assigned to a predicative noun in light verb constructions? 3. Should verbal idioms be included, and, if so, at which level of a description? Our conclusion is that all reflexive verbs should be treated as separate lemmas since they are distinct lexemes that have undergone semantic and syntactic change. To differentiate between a semantically full verb and a light verb, we have introduced the label LV and decided not to assign a semantic role to a predicative noun. By including verbal idioms and their translation into English, non-native users can benefit from the lexicon. The aim is to enhance the verb lexicon for the more effective description and recognition of verbal multiword expressions.

The paper reports on the first steps in developing a time-stamped multimodal dataset of reading data by Bulgarian children. Data are being collected, structured and analysed by means of ReadLet, an innovative infrastructure for multimodal language data collection that uses a tablet as a reader’s front-end. The overall goal of the project is to quantitatively analyse the reading skills of a sample of early Bulgarian readers collected over a two-year period, and compare them with the reading data of early readers of Italian, collected using the same protocol. We illustrate design issues of the experimental protocol, as well as the data acquisition process and the post-processing phase of data annotation/augmentation. To evaluate the potential and usefulness of the Bulgarian dataset for reading research, we present some preliminary statistical analyses of our recently collected data. They show robust convergence trends between Bulgarian and Italian early reading development stages.

pdf bib abs
Educational Horizons: Mapping the Terrain of Artificial Intelligence Integration in Bulgarian Educational Settings
Denitza Kurshumova

The role of artificial intelligence in education (AIEd) has recently become a major topic of discussion and future planning. This article presents data from a large-scale survey involving 1463 Bulgarian educators in primary, secondary, and high schools. The results revealed that 70.30% of the teachers were familiar with or somewhat familiar with the existence of AI applications. Chatbots were the most popular among the surveyed teachers, with ChatGPT ranking as the most familiar. The teachers were almost equally split between those who reported use and those who declared nonuse of AI technology for instructional purposes. A significant association was found between the teachers’ familiarity with and use of AI technology and their age-related generational traits. The younger educators (up to 40 years of age) were associated with higher use of AI technology as a support tool for creating lesson plans, lesson content, tests, and exams. The outlined tendencies can be used to inform policy, professional development, and future research in the realm of AI-driven education.

pdf bib abs
Evidential Auxiliaries as Non–reliability Markers in Bulgarian Parliamentary Speech
Ekaterina Tarpomanova

In the evidentiality system of Bulgarian, there are three evidential auxiliaries that form complex verbal forms. The paper analyzes their potential to mark non-reliability in political discourse by using the ParlaMint-BG corpus of parliamentary debates. The method of the study includes detection, categorisation and context analysis of the evidentials formed with auxiliaries. The results prove that the evidential auxiliaries function as markers of non-reliability, especially in argumentative text type such as political discourse.

pdf bib abs
Extended Context at the Introduction of Complex Vocabulary in Abridged Literary Texts
Iglika Nikolova-Stoupak | Eva Schaeffer-Lacroix | Gaël Lejeune

Psycholinguistics speaks of a fine-tuning process used by parents as they address children, in which complex vocabulary is introduced with additional context (Leung et al., 2021). This somewhat counterintuitive lengthening of text in order to aid one’s interlocutor in the process of language acquisition also comes in accord with Harris (1988)’s notion that for every complex sentence, there is an equivalent longer (non-contracted) yet simpler one that contains the same amount of information. Within the proposed work, a corpus of eight renowned literary works (e.g. Alice’s Adventures in Wonderland, The Adventures of Tom Sawyer, Les Misérables) in four distinct languages (English, French, Russian and Spanish) is gathered: both the original (or translated) versions and up to four abridged versions for various audiences (e.g. children of a defined age or foreign language learners of a defined level) are present. The contexts of the first appearance of complex words (as determined based on word frequency) in pairs of original and abridged works are compared, and the cases in which the abridged texts offer longer context are investigated. The discovered transformations are consequently classified into three separate categories: addition of vocabulary items from the same lexical field as the complex word, simplification of grammar and insertion of a definition. Context extensions are then statistically analysed as associated with different languages and reader audiences.

pdf bib abs
Corpus–based Research into Derivational Morphology: A Comparative Study of Japanese and English Verbalization
Junya Morita

As part of elucidating the syntax-morphology interaction, this study investigates where and how complex verbs are formed in Japanese and English. Focusing on the Japanese verb-forming suffix -ka-suru (e.g. toshi-o gendai-ka-suru ‘modernize city’), relevant verbs are extracted from a large-scale corpus and they receive an in-depth analysis from semantic, morphosyntactic, and functional viewpoints. The properties of -ka-suru and those of its English counterpart are then compared and contrasted. The result reveals three main points: (i) -ka-suru verbs are constantly created in syntactic settings to fulfill the functions of brevity and conceptualization, (ii) while denominal -ize derivatives have several submeanings such as ‘result,’ ‘ornative,’ and ‘agentive,’ -ka-suru equivalents retain the meaning ‘result,’ and (iii) -ka-suru can be combined with compound nouns, but -ize cannot. We will demonstrate that the above features originate in the underlying syntactic structure related to each suffix and their difference, thus supporting the thesis of syntactic word formation. (1) ji-kokumin-o moomai-ka-suru one’s-people-ACC ignorant-change-do ‘make one’s people ignorant’ (2) shinikaketa momiji-o bonsai-ka-suru dying maple-ACC bonsai-change-do ‘turn a dying maple into a bonsai’

pdf bib abs
The Verbal Category of Conditionality in Bulgarian and its Ukrainian Correspondences
Ivan Derzhanski | Olena Siruk

Modern Bulgarian shares a conditional mood with the other Slavic languages, but it also has developed a future-in-the-past tense which is structurally analogous to many Western European languages’ category traditionally called a conditional mood in their grammars. The distinction between these two forms is sometimes elusive and can be difficult for native speakers of Slavic languages who are learning Bulgarian. In this paper we consider the uses of the Bulgarian conditional mood and future-in-the-past tense in a parallel corpus of Bulgarian and Ukrainian text, examining the corresponding wording in Ukrainian, where the conditional mood is supplemented by modal verbs, and discuss the breadth of choices open to translators when working in each direction.

pdf bib abs
Lexical Richness of French and Quebec Journalistic Texts
Natalia Dankova

This paper presents some results of a quantitative study that focuses on the variety and word frequency in texts from a comparative perspective. The study aims to analyze and compare French and Quebec journalistic texts on political and cultural topics written in French and recently published in major newspapers such as Le Monde, le Figaro, Le Devoir, etc. The statistical analysis concerns the number of different words in the text, the number of different adjectives, the number of different verbs (and also passive structures, participles and gerunds which contribute to syntactic and stylistic sophistication), and the number of hapaxes. French texts from France exhibit greater lexical richness and sophistication: they contain more adjectives, a greater variety of adjectives, as well as more participles and gerunds compared to French texts from Quebec. The originality of the study lies in the fact that it analyzes variation in French using a lexicometric approach.

pdf bib abs
A Corpus of Liturgical Texts in German: Towards Multilevel Text Annotation
Maria Khokhlova | Mikhail Koryshev

The aim of the study is to create a “documented” literary and theological history of German Catholic hymnography. The paper focuses on the creation of a corpus of liturgical texts in German and describes the first stage of annotation dealing with the metatextual markup of Catholic hymns. The authors dwell in detail on the parameters of the multi-level classification of hymn texts they developed, which allows them to differentiate hymns on different grounds. The parameters include not only characteristics that represent hymns (the period and the source of their origin, rubrics, musical accompaniment), but also ones that are inherent for strophes. Based on the created markup, it is possible to trace general trends in texts divided according to certain meta-features. The developed scheme of annotation is given on the example of the hymnbook Gotteslob (1975). The results present statistics on different parameters used for hymn description.

pdf bib abs
EurLexSummarization – A New Text Summarization Dataset on EU Legislation in 24 Languages with GPT Evaluation
Valentin Zmiycharov | Todor Tsonkov | Ivan Koychev

Legal documents are notorious for their length and complexity, making it challenging to extract crucial information efficiently. In this paper, we introduce a new dataset for legal text summarization, covering 24 languages. We not only present and analyze the dataset but also conduct experiments using various extractive techniques. We provide a comparison between these techniques and summaries generated by the state-of-the-art GPT models. The abstractive GPT approach outperforms the extractive TextRank approach in 8 languages, but produces slightly lower results in the remaining 16 languages. This research aims to advance the field of legal document summarization by addressing the need for accessible and comprehensive information retrieval from lengthy legal texts.

pdf bib abs
On a Hurtlex Resource for Bulgarian
Petya Osenova

The paper reports on the cleaning of the Hurtlex lexicon for Bulgarian as part of the multilingual Hurtlex resource. All the challenges during the cleaning process are presented, such as: deleting strings or lexica that are clear errors from the automatic translation, establishing criteria for keeping or discarding a lexeme based on its meaning and potential usages, contextualizing the lexeme with the meaning through an example, etc. In addition, the paper discusses the mapping of the offensive lexica to the BTB-Wordnet as well as the system that has been used.

pdf bib abs
Unified Annotation of the Stages of the Bulgarian Language. First Steps
Fabio Maion | Tsvetana Dimitrova | Andrej Bojadziev

The paper reports on an ongoing work on a proposal of guidelines for unified annotation of the stages in the development of the Bulgarian language from the Middle Ages to the early modern period. It discusses the criteria for the selection of texts and their representation, along with some results of the trial tagging with an existing tagger which was already trained on other texts.

pdf bib abs
ChatGPT: Detection of Spanish Terms Based on False Friends
Amal Haddad Haddad | Damith Premasiri

One of the common errors which translators commit when transferring terms from one lan- guage into another is erroneously coining terms which are based on a false friend mistake due to the similarity between lexical units forming part of terms. In this case-study, we use Chat- GPT to automatically detect terms in Spanish which may be coined based on a false friend relation. To carry out this study, we imple- mented two experiments with GPT and com- pared the results. In the first, we prompted GPT to produce a list of twenty terms in Span- ish extracted from the UN discourse, which are possibly based on false friend relation, and its English equivalents and analysed the veracity of the results. In the second experiment, we used an aligned corpus to further study the ca- pabilities of the Language Model on detecting false friends in English and Spanish Text. Some results were significant for future terminologi- cal studies.

pdf bib abs
Deep Learning Framework for Identifying Future Market Opportunities from Textual User Reviews
Jordan Kralev

The paper develops an application of design gap theory for identification of future market segment growth and capitalization from a set of customer reviews for bought products from the market in a given past period. To build a consumer feature space, an encoded-decoder network with attention is trained over the textual reviews after they are pre-processed through tokenization and embedding layers. The encodings for product reviews are used to train a variational auto encoder network for representation of a product feature space. The sampling capabilities of this network are extended with a function to look for innovative designs with high consumer preferences, characterizing future opportunities in a given market segment. The framework is demonstrated for processing of Amazon reviews in consumer electronics segment.

pdf bib abs
Look Who’s Talking: The Most Frequently Used Words in the Bulgarian Parliament 1990-2024
Ruslana Margova | Bastiaan Bruinsma

In this study we identify the most frequently used words and some multi-word expressions in the Bulgarian Parliament. We do this by using the transcripts of all plenary sessions between 1990 and 2024 - 3,936 in total. This allows us both to study an interesting period known in the Bulgarian linguistic space as the years of “transition and democracy”, and to provide scholars of Bulgarian politics with a purposefully generated list of additional stop words that they can use for future analysis. Because our list of words was generated from the data, there is no preconceived theory, and because we include all interactions during all sessions, our analysis goes beyond traditional party lines. We provide details of how we selected, retrieved, and cleaned our data, and discuss our findings.

pdf bib abs
Estimating Commonsense Knowledge from a Linguistic Analysis on Information Distribution
Sabrina Mennella | Maria Di Maro | Martina Di Bratto

Commonsense Knowledge (CSK) is defined as a complex and multifaceted structure, encompassing a wide range of knowledge and reasoning generally acquired through everyday experiences. As CSK is often implicit in communication, it poses a challenge for AI systems to simulate human-like interaction. This work aims to deepen the CSK information structure from a linguistic perspective, starting from its organisation in conversations. To achieve this goal, we developed a three-level analysis model to extract more insights about this knowledge, focusing our attention on the second level. In particular, we aimed to extract the distribution of explicit actions and their execution order in the communicative flow. We built an annotation scheme based on FrameNet and applied it to a dialogical corpus on the culinary domain. Preliminary results indicate that certain frames occur earlier in the dialogues, while others occur towards the process’s end. These findings contribute to the systematic nature of actions by establishing clear patterns and relationships between frames.

pdf bib abs
Pondera: A Personalized AI–Driven Weight Loss Mobile Companion with Multidimensional Goal Fulfillment Analytics
Georgi Pashev | Silvia Gaftandzhieva

The global obesity epidemic is a significant challenge to public health, necessitating innovative and personalized solutions. This paper presents Pondera, an innovative mobile app revolutionizing weight management by integrating Artificial Intelligence (AI) and multidimensional goal fulfilment analytics. Pondera distinguishes itself by supplying a tailored approach to weight loss, combining individual user data, including dietary preferences, fitness levels, and specific weight loss objectives, with advanced AI algorithms to generate personalized weight loss plans. Future development directions include refining AI algorithms, enhancing user experience, and validating effectiveness through comprehensive studies, ensuring Pondera becomes a pivotal tool in achieving sustainable weight loss and health improvement.

pdf bib abs
Mitigating Hallucinations in Large Language Models via Semantic Enrichment of Prompts: Insights from BioBERT and Ontological Integration
Stanislav Penkov

The advent of Large Language Models (LLMs) has been transformative for natural language processing, yet their tendency to produce “hallucinations”—outputs that are factually incorrect or entirely fabricated— remains a significant hurdle. This paper introduces a proactive methodology for reducing hallucinations by strategically enriching LLM prompts. This involves identifying key entities and contextual cues from varied domains and integrating this information into the LLM prompts to guide the model towards more accurate and relevant responses. Leveraging examples from BioBERT for biomedical entity recognition and ChEBI for chemical ontology, we illustrate a broader approach that encompasses semantic prompt enrichment as a versatile tool for enhancing LLM output accuracy. By examining the potential of semantic and ontological enrichment in diverse contexts, we aim to present a scalable strategy for improving the reliability of AI-generated content, thereby contributing to the ongoing efforts to refine LLMs for a wide range of applications.

pdf bib abs
Commercially Minor Languages and Localization
Maria Todorova

This paper offers a perspective of languages with a less significant volume of digital usership as minor in the context of globalization and localization. With this premise, the risks this status poses to the quality of localized texts, the substantiality of genre conventions, the public image of professional translators, and the users’ linguistic competence in these languages is explored. Furthermore, the common lack of established or clear conventions in the localization of digital products into commercially minor languages (and in the digital product genres) is highlighted as one of the factors amplifying these risks. These perspectives are contextualized with the Bulgarian language with examples of errors encountered in Bulgarian digital content localized from English and more specifically – errors and problems related to gender neutrality and register.

pdf bib abs
Semantic features in the automatic analysis of verbs of creation in Bulgarian and English
Ivelina Stoyanova

The paper focuses on the semantic class of verbs of creation as a subclass of dynamic verbs. The objective is to present the description of creation verbs in terms of their corresponding semantic frames and to outline the semantic features of the frame elements with a view to their automatic identification and analysis in text. The observations are performed on Bulgarian and English data with the aim to establish the language-independent and language-specific features in the semantic description of the analysed class of verbs.

pdf bib abs
A ‘Dipdive’ into Motion: Exploring Lexical Resources towards a Comprehensive Semantic and Syntactic Description
Svetlozara Leseva

In this paper I illustrate the semantic description of verbs provided in three semantic resources (FrameNet, VerbNet and VerbAtlas) in comparative terms with a view to identifying common and distinct components in their representation and obtaining a preliminary idea of the resources’ interoperability. To this end, I provide a comparison of a small sample of motion verbs aligned with semantic frames and classes in the three resources. I also describe the semantic annotation of Bulgarian motion verbs using the framework defined in the Berkeley FrameNet project and its enrichment with information from the other two resources, which has been enabled by the mapping between: (i) their major semantic units – FrameNet frames, VerbNet classes and VerbAtlas frames, and (ii) their ’building blocks’ – frame elements (FrameNet )and semantic roles (VerbNet, VerbAtlas).

pdf bib abs
Multilingual Corpus of Illustrative Examples on Activity Predicates
Ivelina Stoyanova | Hristina Kukova | Maria Todorova | Tsvetana Dimitrova

The paper presents the ongoing process of compilation of a multilingual corpus of illustrative examples to supplement our work on the syntactic and semantic analysis of predicates representing activities in Bulgarian and other languages. The corpus aims to include over 1,000 illustrative examples on verbs from six semantic classes of predicates (verbs of motion, contact, consumption, creation, competition and bodily functions) which provide a basis for observations on the specificity of their realisation. The corpus of illustrative examples will be used for contrastive studies and further elaboration on the scope and behaviour of activity verbs in general, as well as its semantic subclasses.

pdf bib abs
Large Language Models in Linguistic Research: the Pilot and the Copilot
Svetla Koeva

In this paper, we present two experiments focussing on linguistic classification and annotation of examples, using zero-shot prompting. The aim is to show how large language models can confirm or reject the linguistic judgements of experts in order to increase the productivity of their work. In the first experiment, new lexical units evoking a particular FrameNet semantic frame are selected simultaneously with the annotation of examples with the core frame elements. The second experiment attempts to categorise verbs into the aspectual classes, assuming that only certain combinations of verbs belonging to different aspectual classes evoke a semantic frame. The linguistic theories underlying the two experiments, the development of the prompts and the results of the experiments are presented.