Nikolay Mikhaylovskiy

2025

Zipf’s and Heaps’ Laws for Tokens and LLM-generated Texts
Nikolay Mikhaylovskiy
Findings of the Association for Computational Linguistics: EMNLP 2025

The frequency distribution of words in human-written texts roughly follows a simple mathematical form known as Zipf’s law. Somewhat less well known is the related Heaps’ law, which describes a sublinear power-law growth of vocabulary size with document size. We study the applicability of Zipf’s and Heaps’ laws to texts generated by Large Language Models (LLMs). We empirically show that Heaps’ and Zipf’s laws only hold for LLM-generated texts in a narrow model-dependent temperature range. These temperatures have an optimal value close to t=1 for all the base models except the large Llama models, are higher for instruction-finetuned models and do not depend on the model size or prompting. This independently confirms the recent discovery of sampling temperature dependent phase transitions in LLM-generated texts.

2024

pdf bib abs

Overview of Long Story Generation Challenge (LSGC) at INLG 2024
Aleksandr Migal | Daria Seredina | Ludmila Telnina | Nikita Nazarov | Anastasia Kolmogorova | Nikolay Mikhaylovskiy
Proceedings of the 17th International Natural Language Generation Conference: Generation Challenges

This report describes the setup and results of the shared task of human-like long story generation, the LSG Challenge, which asks to generate a consistent, human-like long story (a Harry Potter fanfic in English for a general audience) given a prompt of about 1,000 tokens. We evaluated the submissions using both automated metrics and human evaluation protocols. The automated metrics, including the GAPELMAPER score, assessed the structuredness of the generated texts, while human annotators rated stories on dimensions such as relevance, consistency, fluency, and coherence. Additionally, annotators evaluated the models’ understanding of abstract concepts, causality, the logical order of events, and the avoidance of repeated plot elements. The results highlight the current strengths and limitations of state-of-the-art models in long-form story generation, with key challenges emerging in maintaining coherence over extended narratives and handling complex story dynamics. Our analysis provides insights into future directions for improving long story generation systems.

pdf bib abs

TSU HITS’s Submissions to the WMT 2024 General Machine Translation Shared Task
Vladimir Mynka | Nikolay Mikhaylovskiy
Proceedings of the Ninth Conference on Machine Translation

This paper describes the TSU HITS team’s submission system for the WMT’24 general translation task. We focused on exploring the capabilities of discrete diffusion models for the English-to-{Russian, German, Czech, Spanish} translation tasks in the constrained track. Our submission system consists of a set of discrete diffusion models for each language pair. The main advance is using a separate length regression model to determine the length of the output sequence more precisely.

2023

pdf bib abs

Team NTR @ AutoMin 2023: Dolly LLM Improves Minuting Performance, Semantic Segmentation Doesn’t
Eugene Borisov | Nikolay Mikhaylovskiy
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges

This paper documents the approach of Team NTR for the Second Shared Task on Automatic Minuting (AutoMin) at INLG 2023. The goal of this work is to develop a module for automatic generation of meeting minutes based on a meeting transcript text produced by an Automated Speech Recognition (ASR) system (Task A). We consider minuting as a supervised machine learning task on pairs of texts: the transcript of the meeting and its minutes. We use a two-staged minuting pipeline that consists of segmentation and summarization. We experiment with semantic segmentation and multi-language approaches and Large Language Model Dolly, and achieve Rouge1-F of 0.2455 and BERT-Score of 0.8063 on the English part of ELITR test set and Rouge1-F of 0.2430 and BERT-Score of 0.8332 on the EuroParl dev set with the submitted Naive Segmentation + Dolly7b pipeline.

pdf bib abs

Long Story Generation Challenge
Nikolay Mikhaylovskiy
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges

We propose a shared task of human-like long story generation, LSG Challenge, that asks models to output a consistent human-like long story (a Harry Potter generic audience fanfic in English), given a prompt of about 1K tokens. We suggest a novel statistical metric of the text structuredness, GloVe Autocorrelations Power/ Exponential Law Mean Absolute Percentage Error Ratio (GAPELMAPER) and the use of previously-known UNION metric and a human evaluation protocol. We hope that LSG can open new avenues for researchers to investigate sampling approaches, prompting strategies, autoregressive and non-autoregressive text generation architectures and break the barrier to generate consistent long (40K+ word) texts.

2021

pdf bib abs

Language ID Prediction from Speech Using Self-Attentive Pooling
Roman Bedyakin | Nikolay Mikhaylovskiy
Proceedings of the Third Workshop on Computational Typology and Multilingual NLP

This memo describes NTR-TSU submission for SIGTYP 2021 Shared Task on predicting language IDs from speech. Spoken Language Identification (LID) is an important step in a multilingual Automated Speech Recognition (ASR) system pipeline. For many low-resource and endangered languages, only single-speaker recordings may be available, demanding a need for domain and speaker-invariant language ID systems. In this memo, we show that a convolutional neural network with a Self-Attentive Pooling layer shows promising results for the language identification task.

Co-authors

Nikita Nazarov 1

Daria Seredina 1

Ludmila Telnina 1

Venues

WS1

Fix author