Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization

Chong Yu; Tao Chen; Zhongxue Gan

doi:10.18653/v1/2023.findings-acl.15

Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization

Abstract

Along with the performance improvement in NLP domain, the sizes of transformer-based language models (TLM) are also dramatically increased. Some prior works intend to compress TLM models into more compact forms, but do not fully consider the hardware characters may not support the efficient execution for these forms, leading to the deployment of TLM on hardware with noticeable acceleration is still challenging. This paper thoroughly designs a compression scheme named GPUSQ-TLM to maximally utilize the GPU-friendly 2:4 fine-grained structured sparsity and quantization characters. Especially, a dense TLM model is first pruned to meet the GPU’s acceleration constraint of sparse patterns with FP16 type, then it is further quantized into a fixed-point one by quantization-aware training, to provide an extra speedup for integer tensors on GPU. A mixed-strategy knowledge distillation of labels, logits and feature maps is used for best accuracy compensation during pruning and quantization process. Experiment results show GPUSQ-TLM scheme achieves state-of-the-art compression on TLM model of various encoder and decoder blocks with negligible accuracy degradation on SQuAD, GLUE, CNN-DM & XSum and WikiText benchmarking tasks. Moreover, GPUSQ-TLM can boost actual deployment performance by up to 4.08-4.25x latency and 6.18-6.79x throughput on A100 GPU.

Anthology ID:: 2023.findings-acl.15
Volume:: Findings of the Association for Computational Linguistics: ACL 2023
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 218–235
Language:
URL:: https://aclanthology.org/2023.findings-acl.15/
DOI:: 10.18653/v1/2023.findings-acl.15
Bibkey:
Cite (ACL):: Chong Yu, Tao Chen, and Zhongxue Gan. 2023. Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 218–235, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization (Yu et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-acl.15.pdf

PDF Cite Search Fix data