Viktoriia A. Chekalina


2026

Fine-tuning Transformer models is often dominated by the backward computation in linear layers. In many NLP tasks, input sequences are short and padded to a fixed context length, inducing structured sparsity in the output gradients. We propose Sparsity-Exploiting Backward Pass (SEBP), a heuristic method that reduces backward computation by exploiting this sparsity with negligible memory overhead. We show that, for short input sequences, the output gradients of BERT-based and LLaMA models exhibit pronounced sparsity, allowing for optimisation in the backward computation. We optimized the autograd function in the linear layers, significantly reducing the number of FLOPs during the backward.Our method achieves a backward pass speedup of approximately 2.15x for BERT-base on GLUE tasks and 1.99x for a 3B LLaMA model on reasoning benchmarks, while maintaining memory usage nearly identical to the regular PyTorch fine-tuning. Crucially, this speedup comes at no cost to performance. We show that our method matches standard convergence rates, offering a memory-efficient way to accelerate LLM fine-tuning.
Modern LLM education is increasingly centered on system building: grounding generation with retrieval, enabling tool use, and deploying models under latency and cost constraints.We present an updated release of our open course on Transformer-based LLMs and multimodal models (Nikishina et al, 2024).The update introduces topics which became importance since the first edition, namely session on Retrieval Augmented Generation (RAG), a hands-on session on tool-using agents, an API-based track for applied work with LLM, and practical local inference with vLLM.We also add a dedicated session on multimodal dialog models with a focus on dialog grounding. We enriched the course with a discussion on long-context transformers, focusing on KV-cache efficiency along with the related models and benchmarks.All materials are released online.

2024

The performance of Transformer models has been enhanced by increasing the number of parameters and the length of the processed text. Consequently, fine-tuning the entire model becomes a memory-intensive process. High-performance methods for parameter-efficient fine-tuning (PEFT) typically work with Attention blocks and often overlook MLP blocks, which contain about half of the model parameters. We propose a new selective PEFT method, namely SparseGrad, that performs well on MLP blocks. We transfer layer gradients to a space where only about 1% of the layer’s elements remain significant. By converting gradients into a sparse structure, we reduce the number of updated parameters. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task. In these experiments, with identical memory requirements, our method outperforms LoRA and MeProp, robust popular state-of-the-art PEFT approaches.