Svetla Peneva Koeva


2025

pdf bib
IfGPT: A Dataset in Bulgarian for Large Language Models
Svetla Peneva Koeva | Ivelina Stoyanova | Jordan Konstantinov Kralev
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages

The paper presents the large dataset IfGPT, which contains available corpora and datasets for Bulgarian, and describes methods to continuously expand it with unduplicated and unbiased Bulgarian data. The samples in the dataset are annotated with metadata that enable effective extraction of domain- and application-oriented datasets for fine-tuning or Retrieval Augmented Generation (RAG) of large language models (LLMs). The paper focuses on the description of the extended metadata of the IfGPT dataset and its management in a graph database.

2023

pdf bib
XL-WA: a Gold Evaluation Benchmark for Word Alignment in 14 Language Pairs
Federico Martelli | Andrei Stefan Bejgu | Cesare Campagnano | Jaka Čibej | Rute Costa | Apolonija Gantar | Jelena Kallas | Svetla Peneva Koeva | Kristina Koppel | Simon Krek | Margit Langemets | Veronika Lipp | Sanni Nimb | Sussi Olsen | Bolette Sanford Pedersen | Valeria Quochi | Ana Salgado | László Simon | Carole Tiberius | Rafael-J Ureña-Ruiz | Roberto Navigli
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)