Motivating Next-Gen Accelerators with Flexible N:M Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches

Shirin Alanova; Kristina Kazistova; Ekaterina Galaeva; Alina Kostromina; Vladimir Smirnov; Redko Dmitry; Alexey Dontsov; Maxim Zhelnin; Evgeny Burnaev; Egor Shvetsov

Motivating Next-Gen Accelerators with Flexible N:M Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches

Shirin Alanova, Kristina Kazistova, Ekaterina Galaeva, Alina Kostromina, Vladimir Smirnov, Redko Dmitry, Alexey Dontsov, Maxim Zhelnin, Evgeny Burnaev, Egor Shvetsov

Abstract

The demand for efficient large language model inference has spurred interest in sparsification, yet current hardware support remains narrowly focused on 2:4 weight sparsity. In this work, we argue that activation sparsity despite being overlooked in hardware design offers a promising path for dynamic, input-adaptive compression with significant I/O and memory benefits. We present a comprehensive post-training study of N:M activation pruning across four LLMs (Llama2-7B-chat, Llama3.1-8B-Instruct, Qwen2.5-7B-Instruct, Gemma3-4B-Instruct), demonstrating that activation pruning consistently outperforms weight pruning at matched sparsity levels. We evaluate lightweight, plug-and-play error mitigation and selection strategies that require minimal or no calibration data across four sparsity patterns: 2:4, 4:8, 8:16, and 16:32. Among these, 16:32 approaches the performance of unstructured 50% sparsity and is is approximately 2.7× better than 2:4, while 8:16 offers an optimal balance of accuracy and practicality. Our results provide evidence that next-generation accelerators should consider native support for N:M activation sparsity and can serve as a strong baseline for the future methods.

Anthology ID:: 2026.acl-industry.17
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Yunyao Li, Georg Rehm, Mei Tu
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 242–258
Language:
URL:: https://aclanthology.org/2026.acl-industry.17/
DOI:
Bibkey:
Cite (ACL):: Shirin Alanova, Kristina Kazistova, Ekaterina Galaeva, Alina Kostromina, Vladimir Smirnov, Redko Dmitry, Alexey Dontsov, Maxim Zhelnin, Evgeny Burnaev, and Egor Shvetsov. 2026. Motivating Next-Gen Accelerators with Flexible N:M Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 242–258, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Motivating Next-Gen Accelerators with Flexible N:M Activation Sparsity via Benchmarking Lightweight Post-Training Sparsification Approaches (Alanova et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-industry.17.pdf

PDF Cite Search Fix data