Ivan Yamshchikov


2024

pdf bib
Neural Machine Translation for Malayalam Paraphrase Generation
Christeena Varghese | Sergey Koshelev | Ivan Yamshchikov
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

This study explores four methods of generating paraphrases in Malayalam, utilizing resources available for English paraphrasing and pre-trained Neural Machine Translation (NMT) models. We evaluate the resulting paraphrases using both automated metrics, such as BLEU, METEOR, and cosine similarity, as well as human annotation. Our findings suggest that automated evaluation measures may not be fully appropriate for Malayalam, as they do not consistently align with human judgment. This discrepancy underscores the need for more nuanced paraphrase evaluation approaches especially for highly agglutinative languages.

pdf bib
LLMs Simulate Big5 Personality Traits: Further Evidence
Aleksandra Sorokovikova | Sharwin Rezagholi | Natalia Fedorova | Ivan Yamshchikov
Proceedings of the 1st Workshop on Personalization of Generative AI Systems (PERSONALIZE 2024)

An empirical investigation into the simulation of the Big5 personality traits by large language models (LLMs), namely Llama-2, GPT-4, and Mixtral, is presented. We analyze the personality traits simulated by these models and their stability. This contributes to the broader understanding of the capabilities of LLMs to simulate personality traits and the respective implications for personalized human-computer interaction.

2023

pdf bib
What is Wrong with Language Models that Can Not Tell a Story?
Ivan Yamshchikov | Alexey Tikhonov
Proceedings of the 5th Workshop on Narrative Understanding

In this position paper, we contend that advancing our understanding of narrative and the effective generation of longer, subjectively engaging texts is crucial for progress in modern Natural Language Processing (NLP) and potentially the broader field of Artificial Intelligence. We highlight the current lack of appropriate datasets, evaluation methods, and operational concepts necessary for initiating work on narrative processing.

2022

pdf bib
Do Data-based Curricula Work?
Maxim Surkov | Vladislav Mosin | Ivan Yamshchikov
Proceedings of the Third Workshop on Insights from Negative Results in NLP

Current state-of-the-art NLP systems use large neural networks that require extensive computational resources for training. Inspired by human knowledge acquisition, researchers have proposed curriculum learning - sequencing tasks (task-based curricula) or ordering and sampling the datasets (data-based curricula) that facilitate training. This work investigates the benefits of data-based curriculum learning for large language models such as BERT and T5. We experiment with various curricula based on complexity measures and different sampling strategies. Extensive experiments on several NLP tasks show that curricula based on various complexity measures rarely have any benefits, while random sampling performs either as well or better than curricula.

pdf bib
BERT in Plutarch’s Shadows
Ivan Yamshchikov | Alexey Tikhonov | Yorgos Pantis | Charlotte Schubert | Jürgen Jost
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The extensive surviving corpus of the ancient scholar Plutarch of Chaeronea (ca. 45-120 CE) also contains several texts which, according to current scholarly opinion, did not originate with him and are therefore attributed to an anonymous author Pseudo-Plutarch. These include, in particular, the work Placita Philosophorum (Quotations and Opinions of the Ancient Philosophers), which is extremely important for the history of ancient philosophy. Little is known about the identity of that anonymous author and its relation to other authors from the same period. This paper presents a BERT language model for Ancient Greek. The model discovers previously unknown statistical properties relevant to these literary, philosophical, and historical problems and can shed new light on this authorship question. In particular, the Placita Philosophorum, together with one of the other Pseudo-Plutarch texts, shows similarities with the texts written by authors from an Alexandrian context (2nd/3rd century CE).

2021

pdf bib
StoryDB: Broad Multi-language Narrative Dataset
Alexey Tikhonov | Igor Samenko | Ivan Yamshchikov
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

This paper presents StoryDB — a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.