GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

Ziyang Wang; Jiangfeng Xiao; Chuan Xiao; Ruoxiang LI; Rui Mao; Jianbin Qin

GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

Ziyang Wang, Jiangfeng Xiao, Chuan Xiao, Ruoxiang LI, Rui Mao, Jianbin Qin

Abstract

Large language models (LLMs) are expensive to serve because dense FFN blocks, multi-head attention, and KV caches dominate memory, making structured pruning a natural way to reduce serving costs under tight parameter and memory budgets. We present GRASPrune, a global budgeted structured pruning framework applied post-hoc to a pretrained model that jointly prunes FFN channels and attention KV head groups under a single global parameter budget. GRASPrune attaches lightweight learnable gates to prunable units and optimizes only these gates on a small unlabeled language-modeling calibration set, keeping all backbone weights frozen while enforcing the target sparsity at every step. A final budget-preserving scaling calibration reweights the surviving channels and heads to correct scale shifts introduced by pruning. On LLaMA-2-7B, GRASPrune removes 50% of parameters and achieves 12.18 perplexity on WikiText-2 while maintaining competitive average zero-shot accuracy on five downstream benchmarks, using a short calibration run of four epochs on 512 unlabeled sequences on a single NVIDIA A100 80GB GPU, all without any full-model fine-tuning.

Anthology ID:: 2026.acl-long.491
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10719–10736
Language:
URL:: https://aclanthology.org/2026.acl-long.491/
DOI:
Bibkey:
Cite (ACL):: Ziyang Wang, Jiangfeng Xiao, Chuan Xiao, Ruoxiang LI, Rui Mao, and Jianbin Qin. 2026. GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10719–10736, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models (Wang et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.491.pdf
Checklist:: 2026.acl-long.491.checklist.pdf

PDF Cite Search Checklist Fix data