2024
pdf
bib
abs
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Pavel Chizhov
|
Catherine Arnett
|
Elizaveta Korotkova
|
Ivan P. Yamshchikov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Language models can greatly benefit from efficient tokenization. However, they still mostly utilize the classical Byte-Pair Encoding (BPE) algorithm, a simple and reliable method. BPE has been shown to cause such issues as under-trained tokens and sub-optimal compression that may affect the downstream performance. We introduce PickyBPE, a modified BPE algorithm that carries out vocabulary refinement during tokenizer training by removing merges that leave intermediate “junk” tokens. Our method improves vocabulary efficiency, eliminates under-trained tokens, and does not compromise text compression. Our experiments show that this method either improves downstream performance or does not harm it.
pdf
bib
abs
Knowledge Graph Representation for Political Information Sources
Tinatin Osmonova
|
Alexey Tikhonov
|
Ivan P. Yamshchikov
Proceedings of the Second Workshop on Natural Language Processing for Political Sciences @ LREC-COLING 2024
With the rise of computational social science, many scholars utilize data analysis and natural language processing tools to analyze social media, news articles, and other accessible data sources for examining political and social discourse. Particularly, the study of the emergence of echo-chambers due to the dissemination of specific information has become a topic of interest in mixed methods research areas. In this paper, we analyze data collected from two news portals, Breitbart News (BN) and New York Times (NYT) to prove the hypothesis that the formation of echo-chambers can be partially explained on the level of an individual information consumption rather than a collective topology of individuals’ social networks. Our research findings are presented through knowledge graphs, utilizing a dataset spanning 11.5 years gathered from BN and NYT media portals. We demonstrate that the application of knowledge representation techniques to the aforementioned news streams highlights, contrary to common assumptions, shows relative “internal” neutrality of both sources and polarizing attitude towards a small fraction of entities. Additionally, we argue that such characteristics in information sources lead to fundamental disparities in audience worldviews, potentially acting as a catalyst for the formation of echo-chambers.
pdf
bib
abs
Echo-chambers and Idea Labs: Communication Styles on Twitter
Aleksandra Sorokovikova
|
Michael Becker
|
Ivan P. Yamshchikov
Proceedings of the Second Workshop on Natural Language Processing for Political Sciences @ LREC-COLING 2024
This paper investigates the communication styles and structures of Twitter (X) communities within the vaccination context. While mainstream research primarily focuses on the echo-chamber phenomenon, wherein certain ideas are reinforced and participants are isolated from opposing opinions, this study reveals the presence of diverse communication styles across various communities. In addition to the communities exhibiting echo-chamber behavior, this research uncovers communities with distinct communication patterns. By shedding light on the nuanced nature of communication within social networks, this study emphasizes the significance of understanding the diversity of perspectives within online communities.
pdf
bib
abs
Individuation in Neural Models with and without Visual Grounding
Alexey Tikhonov
|
Lisa Bylinina
|
Ivan P. Yamshchikov
Proceedings of the 1st Workshop on NLP for Science (NLP4Science)
We show differences between a language-and-vision model CLIP and two text-only models — FastText and SBERT — when it comes to the encoding of individuation information. We study latent representations that CLIP provides for substrates, granular aggregates, and various numbers of objects. We demonstrate that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data. Moreover, the individuation hierarchy we deduce from the CLIP embeddings agrees with the hierarchies proposed in linguistics and cognitive science.
pdf
bib
abs
Sui Generis: Large Language Models for Authorship Attribution and Verification in Latin
Svetlana Gorovaia
|
Gleb Schmidt
|
Ivan P. Yamshchikov
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
This paper evaluates the performance of Large Language Models (LLMs) in authorship attribu- tion and authorship verification tasks for Latin texts of the Patristic Era. The study showcases that LLMs can be robust in zero-shot author- ship verification even on short texts without sophisticated feature engineering. Yet, the mod- els can also be easily “mislead” by semantics. The experiments also demonstrate that steering the model’s authorship analysis and decision- making is challenging, unlike what is reported in the studies dealing with high-resource mod- ern languages. Although LLMs prove to be able to beat, under certain circumstances, the traditional baselines, obtaining a nuanced and truly explainable decision requires at best a lot of experimentation.
pdf
bib
abs
Neural Machine Translation for Malayalam Paraphrase Generation
Christeena Varghese
|
Sergey Koshelev
|
Ivan P. Yamshchikov
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
This study explores four methods of generating paraphrases in Malayalam, utilizing resources available for English paraphrasing and pre-trained Neural Machine Translation (NMT) models. We evaluate the resulting paraphrases using both automated metrics, such as BLEU, METEOR, and cosine similarity, as well as human annotation. Our findings suggest that automated evaluation measures may not be fully appropriate for Malayalam, as they do not consistently align with human judgment. This discrepancy underscores the need for more nuanced paraphrase evaluation approaches especially for highly agglutinative languages.
pdf
bib
abs
LLMs Simulate Big5 Personality Traits: Further Evidence
Aleksandra Sorokovikova
|
Sharwin Rezagholi
|
Natalia Fedorova
|
Ivan P. Yamshchikov
Proceedings of the 1st Workshop on Personalization of Generative AI Systems (PERSONALIZE 2024)
An empirical investigation into the simulation of the Big5 personality traits by large language models (LLMs), namely Llama-2, GPT-4, and Mixtral, is presented. We analyze the personality traits simulated by these models and their stability. This contributes to the broader understanding of the capabilities of LLMs to simulate personality traits and the respective implications for personalized human-computer interaction.
pdf
bib
abs
Vygotsky Distance: Measure for Benchmark Task Similarity
Maxim K. Surkov
|
Ivan P. Yamshchikov
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Evaluation plays a significant role in modern natural language processing. Most modern NLP benchmarks consist of arbitrary sets of tasks that neither guarantee any generalization potential for the model once applied outside the test set nor try to minimize the resource consumption needed for model evaluation. This paper presents a theoretical instrument and a practical algorithm to calculate similarity between benchmark tasks, we call this similarity measure “Vygotsky distance”. The core idea of this similarity measure is that it is based on relative performance of the “students” on a given task, rather that on the properties of the task itself. If two tasks are close to each other in terms of Vygotsky distance the models tend to have similar relative performance on them. Thus knowing Vygotsky distance between tasks one can significantly reduce the number of evaluation tasks while maintaining a high validation quality. Experiments on various benchmarks, including GLUE, SuperGLUE, CLUE, and RussianSuperGLUE, demonstrate that a vast majority of NLP benchmarks could be at least 40% smaller in terms of the tasks included. Most importantly, Vygotsky distance could also be used for the validation of new tasks thus increasing the generalization potential of the future NLP models.
2023
pdf
bib
abs
Post Turing: Mapping the landscape of LLM Evaluation
Alexey Tikhonov
|
Ivan P. Yamshchikov
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
In the rapidly evolving landscape of Large Language Models (LLMs), introduction of well-defined and standardized evaluation methodologies remains a crucial challenge. This paper traces the historical trajectory of LLM evaluations, from the foundational questions posed by Alan Turing to the modern era of AI research. We categorize the evolution of LLMs into distinct periods, each characterized by its unique benchmarks and evaluation criteria. As LLMs increasingly mimic human-like behaviors, traditional evaluation proxies, such as the Turing test, have become less reliable. We emphasize the pressing need for a unified evaluation system, given the broader societal implications of these models. Through an analysis of common evaluation methodologies, we advocate for a qualitative shift in assessment approaches, underscoring the importance of standardization and objective criteria. This work serves as a call for the AI community to collaboratively address the challenges of LLM evaluation, ensuring their reliability, fairness, and societal benefit.
pdf
bib
abs
What is Wrong with Language Models that Can Not Tell a Story?
Ivan P. Yamshchikov
|
Alexey Tikhonov
Proceedings of the 5th Workshop on Narrative Understanding
In this position paper, we contend that advancing our understanding of narrative and the effective generation of longer, subjectively engaging texts is crucial for progress in modern Natural Language Processing (NLP) and potentially the broader field of Artificial Intelligence. We highlight the current lack of appropriate datasets, evaluation methods, and operational concepts necessary for initiating work on narrative processing.
2022
pdf
bib
abs
BERT in Plutarch’s Shadows
Ivan P. Yamshchikov
|
Alexey Tikhonov
|
Yorgos Pantis
|
Charlotte Schubert
|
Jürgen Jost
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
The extensive surviving corpus of the ancient scholar Plutarch of Chaeronea (ca. 45-120 CE) also contains several texts which, according to current scholarly opinion, did not originate with him and are therefore attributed to an anonymous author Pseudo-Plutarch. These include, in particular, the work Placita Philosophorum (Quotations and Opinions of the Ancient Philosophers), which is extremely important for the history of ancient philosophy. Little is known about the identity of that anonymous author and its relation to other authors from the same period. This paper presents a BERT language model for Ancient Greek. The model discovers previously unknown statistical properties relevant to these literary, philosophical, and historical problems and can shed new light on this authorship question. In particular, the Placita Philosophorum, together with one of the other Pseudo-Plutarch texts, shows similarities with the texts written by authors from an Alexandrian context (2nd/3rd century CE).
pdf
bib
abs
Do Data-based Curricula Work?
Maxim Surkov
|
Vladislav Mosin
|
Ivan P. Yamshchikov
Proceedings of the Third Workshop on Insights from Negative Results in NLP
Current state-of-the-art NLP systems use large neural networks that require extensive computational resources for training. Inspired by human knowledge acquisition, researchers have proposed curriculum learning - sequencing tasks (task-based curricula) or ordering and sampling the datasets (data-based curricula) that facilitate training. This work investigates the benefits of data-based curriculum learning for large language models such as BERT and T5. We experiment with various curricula based on complexity measures and different sampling strategies. Extensive experiments on several NLP tasks show that curricula based on various complexity measures rarely have any benefits, while random sampling performs either as well or better than curricula.
2021
pdf
bib
abs
StoryDB: Broad Multi-language Narrative Dataset
Alexey Tikhonov
|
Igor Samenko
|
Ivan P. Yamshchikov
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems
This paper presents StoryDB — a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.
2019
pdf
bib
abs
Style Transfer for Texts: Retrain, Report Errors, Compare with Rewrites
Alexey Tikhonov
|
Viacheslav Shibaev
|
Aleksander Nagaev
|
Aigul Nugmanova
|
Ivan P. Yamshchikov
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
This paper shows that standard assessment methodology for style transfer has several significant problems. First, the standard metrics for style accuracy and semantics preservation vary significantly on different re-runs. Therefore one has to report error margins for the obtained results. Second, starting with certain values of bilingual evaluation understudy (BLEU) between input and output and accuracy of the sentiment transfer the optimization of these two standard metrics diverge from the intuitive goal of the style transfer task. Finally, due to the nature of the task itself, there is a specific dependence between these two metrics that could be easily manipulated. Under these circumstances, we suggest taking BLEU between input and human-written reformulations into consideration for benchmarks. We also propose three new architectures that outperform state of the art in terms of this metric.
pdf
bib
abs
Decomposing Textual Information For Style Transfer
Ivan P. Yamshchikov
|
Viacheslav Shibaev
|
Aleksander Nagaev
|
Jürgen Jost
|
Alexey Tikhonov
Proceedings of the 3rd Workshop on Neural Generation and Translation
This paper focuses on latent representations that could effectively decompose different aspects of textual information. Using a framework of style transfer for texts, we propose several empirical methods to assess information decomposition quality. We validate these methods with several state-of-the-art textual style transfer methods. Higher quality of information decomposition corresponds to higher performance in terms of bilingual evaluation understudy (BLEU) between output and human-written reformulations.
pdf
bib
abs
Dyr Bul Shchyl. Proxying Sound Symbolism With Word Embeddings
Ivan P. Yamshchikov
|
Viascheslav Shibaev
|
Alexey Tikhonov
Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP
This paper explores modern word embeddings in the context of sound symbolism. Using basic properties of the representations space one can construct semantic axes. A method is proposed to measure if the presence of individual sounds in a given word shifts its semantics of that word along a specific axis. It is shown that, in accordance with several experimental and statistical results, word embeddings capture symbolism for certain sounds.
2018
pdf
bib
abs
Sounds Wilde. Phonetically Extended Embeddings for Author-Stylized Poetry Generation
Aleksey Tikhonov
|
Ivan P. Yamshchikov
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology
This paper addresses author-stylized text generation. Using a version of a language model with extended phonetic and semantic embeddings for poetry generation we show that phonetics has comparable contribution to the overall model performance as the information on the target author. Phonetic information is shown to be important for English and Russian language. Humans tend to attribute machine generated texts to the target author.