Denis Dimitrov


2024

pdf bib
The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models
Anton Razzhigaev | Matvey Mikhalchuk | Elizaveta Goncharova | Ivan Oseledets | Denis Dimitrov | Andrey Kuznetsov
Findings of the Association for Computational Linguistics: EACL 2024

In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of embeddings in transformer architectures, focusing on the dichotomy between encoders and decoders. Our findings reveal that the anisotropy profile in transformer decoders exhibits a distinct bell-shaped curve, with the highest anisotropy concentrations in the middle layers. This pattern diverges from the more uniformly distributed anisotropy observed in encoders. In addition, we found that the intrinsic dimension of embeddings increases in the initial phases of training, indicating an expansion into higher-dimensional space. This fact is then followed by a compression phase towards the end of training with dimensionality decrease, suggesting a refinement into more compact representations. Our results provide fresh insights to the understanding of encoders and decoders embedding properties.

2023

pdf bib
Kandinsky: An Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion
Anton Razzhigaev | Arseniy Shakhmatov | Anastasia Maltseva | Vladimir Arkhipkin | Igor Pavlov | Ilya Ryabov | Angelina Kuts | Alexander Panchenko | Andrey Kuznetsov | Denis Dimitrov
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Text-to-image generation is a significant domain in modern computer vision and achieved substantial improvements through the evolution of generative architectures. Among these, diffusion-based models demonstrated essential quality enhancements. These models generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky – a novel exploration of latent diffusion architecture, combining the principles of image prior models with latent diffusion techniques. The image prior model, is trained separately to map CLIP text and image embeddings. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation and text-guided inpainting/outpainting. Additionally we released the source code and checkpoints for Kandinsky models. Experimental evaluations demonstrate FID score of 8.03 on the COCO-30K dataset, marking our model as the top open source performer in terms of measurable image generation quality.

2022

pdf bib
Pixel-Level BPE for Auto-Regressive Image Generation
Anton Razzhigaev | Anton Voronov | Andrey Kaznacheev | Andrey Kuznetsov | Denis Dimitrov | Alexander Panchenko
Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models

Pixel-level autoregression with Transformer models (Image GPT or iGPT) is one of the recent approaches to image generation that has not received massive attention and elaboration due to quadratic complexity of attention as it imposes huge memory requirements and thus restricts the resolution of the generated images. In this paper, we propose to tackle this problem by adopting Byte-Pair-Encoding (BPE) originally proposed for text processing to the image domain to drastically reduce the length of the modeled sequence. The obtained results demonstrate that it is possible to decrease the amount of computation required to generate images pixel-by-pixel while preserving their quality and the expressiveness of the features extracted from the model. Our results show that there is room for improvement for iGPT-like models with more thorough research on the way to the optimal sequence encoding techniques for images.