Svetla Peneva Koeva

2025

IfGPT: A Dataset in Bulgarian for Large Language Models
Svetla Peneva Koeva | Ivelina Stoyanova | Jordan Konstantinov Kralev
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages

The paper presents the large dataset IfGPT, which contains available corpora and datasets for Bulgarian, and describes methods to continuously expand it with unduplicated and unbiased Bulgarian data. The samples in the dataset are annotated with metadata that enable effective extraction of domain- and application-oriented datasets for fine-tuning or Retrieval Augmented Generation (RAG) of large language models (LLMs). The paper focuses on the description of the extended metadata of the IfGPT dataset and its management in a graph database.

Svetla Peneva Koeva

2025

2023

Co-authors

Venues