WRP: Weight Recover Prune for Structured Sparsity

Zhendong Tan, Xingjun Zhang, Zheng Wei


Abstract
As the scale of Large Language Models (LLMs) increases, it is necessary to compress the models to reduce the substantial demand on computational resources. Network pruning significantly reduces the model size by converting the weight matrix from dense to sparse data format. Current methodologies advocate for one-shot pruning to avoid the expense of retraining, ensuring the maintenance of model performance under conditions of 50%-60% unstructured pruning. Nevertheless, matrices characterized by this level of sparsity could not be treated as sparse matrices, because the indices would incur significant costs. To mitigate this problem, NVIDIA introduced the 2:4 structured sparsity. However, we observe a notable decline in model performance when adopting 2:4 structured sparsity due to group constraints. In this paper, we introduce the Weight Recover Prune (WRP) approach. By recovering a minimal set of critical weights, WRP aims to enhance model performance while maintaining the efficiency of the compression. Our evaluation of the WRP method on the LLAMA2 and OPT models shows that it outperforms other 2:4 pattern one-shot pruning methods. Meanwhile, WRP can guarantee that the size of the pruned model is about 60% of the dense model. Our code is available at: https://github.com/TanZhendong/WRP.
Anthology ID:
2024.acl-long.347
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6433–6443
Language:
URL:
https://aclanthology.org/2024.acl-long.347
DOI:
Bibkey:
Cite (ACL):
Zhendong Tan, Xingjun Zhang, and Zheng Wei. 2024. WRP: Weight Recover Prune for Structured Sparsity. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6433–6443, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
WRP: Weight Recover Prune for Structured Sparsity (Tan et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.347.pdf