Dmitrii Topchii


2026

Fine-tuning Transformer models is often dominated by the backward computation in linear layers. In many NLP tasks, input sequences are short and padded to a fixed context length, inducing structured sparsity in the output gradients. We propose Sparsity-Exploiting Backward Pass (SEBP), a heuristic method that reduces backward computation by exploiting this sparsity with negligible memory overhead. We show that, for short input sequences, the output gradients of BERT-based and LLaMA models exhibit pronounced sparsity, allowing for optimisation in the backward computation. We optimized the autograd function in the linear layers, significantly reducing the number of FLOPs during the backward.Our method achieves a backward pass speedup of approximately 2.15x for BERT-base on GLUE tasks and 1.99x for a 3B LLaMA model on reasoning benchmarks, while maintaining memory usage nearly identical to the regular PyTorch fine-tuning. Crucially, this speedup comes at no cost to performance. We show that our method matches standard convergence rates, offering a memory-efficient way to accelerate LLM fine-tuning.