Acceleration of Backpropagation in Linear Layers of Transformer Models Based on Gradient Structure

Dmitrii Topchii; Alexander Panchenko; Viktoriia A. Chekalina

Acceleration of Backpropagation in Linear Layers of Transformer Models Based on Gradient Structure

Dmitrii Topchii, Alexander Panchenko, Viktoriia A. Chekalina

Abstract

Fine-tuning Transformer models is often dominated by the backward computation in linear layers. In many NLP tasks, input sequences are short and padded to a fixed context length, inducing structured sparsity in the output gradients. We propose Sparsity-Exploiting Backward Pass (SEBP), a heuristic method that reduces backward computation by exploiting this sparsity with negligible memory overhead. We show that, for short input sequences, the output gradients of BERT-based and LLaMA models exhibit pronounced sparsity, allowing for optimisation in the backward computation. We optimized the autograd function in the linear layers, significantly reducing the number of FLOPs during the backward.Our method achieves a backward pass speedup of approximately 2.15x for BERT-base on GLUE tasks and 1.99x for a 3B LLaMA model on reasoning benchmarks, while maintaining memory usage nearly identical to the regular PyTorch fine-tuning. Crucially, this speedup comes at no cost to performance. We show that our method matches standard convergence rates, offering a memory-efficient way to accelerate LLM fine-tuning.

Anthology ID:: 2026.eacl-srw.31
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Selene Baez Santamaria, Sai Ashish Somayajula, Atsuki Yamaguchi
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 426–436
Language:
URL:: https://aclanthology.org/2026.eacl-srw.31/
DOI:
Bibkey:
Cite (ACL):: Dmitrii Topchii, Alexander Panchenko, and Viktoriia A. Chekalina. 2026. Acceleration of Backpropagation in Linear Layers of Transformer Models Based on Gradient Structure. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 426–436, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Acceleration of Backpropagation in Linear Layers of Transformer Models Based on Gradient Structure (Topchii et al., EACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.eacl-srw.31.pdf

PDF Cite Search Fix data