Nikolai Gerasimenko
2024
Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework
Arkhipkin Vladimir
|
Viacheslav Vasilev
|
Andrei Filatov
|
Igor Pavlov
|
Julia Agafonova
|
Nikolai Gerasimenko
|
Anna Averchenkova
|
Evelina Mironova
|
Bukashkin Anton
|
Konstantin Kulikov
|
Andrey Kuznetsov
|
Denis Dimitrov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Text-to-image (T2I) diffusion models are popular for introducing image manipulation methods, such as editing, image fusion, inpainting, etc. At the same time, image-to-video (I2V) and text-to-video (T2V) models are also built on top of T2I models. We present Kandinsky 3, a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism. The key feature of the new architecture is the simplicity and efficiency of its adaptation for many types of generation tasks. We extend the base T2I model for various applications and create a multifunctional generation system that includes text-guided inpainting/outpainting, image fusion, text-image fusion, image variations generation, I2V and T2V generation. We also present a distilled version of the T2I model, evaluating inference in 4 steps of the reverse process without reducing image quality and 3 times faster than the base model. We deployed a user-friendly demo system in which all the features can be tested in the public domain. Additionally, we released the source code and checkpoints for the Kandinsky 3 and extended models. Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.
Your Transformer is Secretly Linear
Anton Razzhigaev
|
Matvey Mikhalchuk
|
Elizaveta Goncharova
|
Nikolai Gerasimenko
|
Ivan Oseledets
|
Denis Dimitrov
|
Andrey Kuznetsov
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper reveals a novel linear characteristic exclusive to transformer decoders, including models like GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering an almost perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed, due to a consistently low transformer layer output norm. Our experiments show that pruning or linearly approximating some of the layers does not impact loss or model performance significantly. Moreover, we introduce a cosine-similarity-based regularization in our pretraining experiments on smaller models, aimed at reducing layer linearity. This regularization not only improves performance metrics on benchmarks like Tiny Stories and SuperGLUE but as well successfully decreases the linearity of the models. This study challenges the existing understanding of transformer architectures, suggesting that their operation may be more linear than previously assumed.