Svitlana Galeshchuk
2024
Entity Embellishment Mitigation in LLMs Output with Noisy Synthetic Dataset for Alignment
Svitlana Galeshchuk
Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024
The present work focuses on the entity embellishments when named entities are accompanied by additional information that is not supported by the context or the source material. Our paper contributes into mitigating this problem in large language model’s generated texts, summaries in particular, by proposing the approach with synthetic noise injection in the generated samples that are further used for alignment of finetuned LLM. We also challenge the issue of solutions scarcity for low-resourced languages and test our approach with corpora in Ukrainian.
2023
Abstractive Summarization for the Ukrainian Language: Multi-Task Learning with Hromadske.ua News Dataset
Svitlana Galeshchuk
Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)
Despite recent NLP developments, abstractive summarization remains a challenging task, especially in the case of low-resource languages like Ukrainian. The paper aims at improving the quality of summaries produced by mT5 for news in Ukrainian by fine-tuning the model with a mixture of summarization and text similarity tasks using summary-article and title-article training pairs, respectively. The proposed training set-up with small, base, and large mT5 models produce higher quality résumé. Besides, we present a new Ukrainian dataset for the abstractive summarization task that consists of circa 36.5K articles collected from Hromadske.ua until June 2021.
2019
Sentiment Analysis for Multilingual Corpora
Svitlana Galeshchuk
|
Ju Qiu
|
Julien Jourdan
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
The paper presents a generic approach to the supervised sentiment analysis of social media content in Slavic languages. The method proposes translating the documents from the original language to English with Google’s Neural Translation Model. The resulted texts are then converted to vectors by averaging the vectorial representation of words derived from a pre-trained Word2Vec English model. Testing the approach with several machine learning methods on Polish, Slovenian and Croatian Twitter datasets returns up to 86% of classification accuracy on the out-of-sample data.