Efficient Two-Stage Progressive Quantization of BERT

Charles Le, Arash Ardakani, Amir Ardakani, Hang Zhang, Yuyan Chen, James Clark, Brett Meyer, Warren Gross


Abstract
The success of large BERT models has raised the demand for model compression methods to reduce model size and computational cost. Quantization can reduce the model size and inference latency, making inference more efficient, without changing its stucture, but it comes at the cost of performance degradation. Due to the complex loss landscape of ternarized/binarized BERT, we present an efficient two-stage progressive quantization method in which we fine tune the model with quantized weights and progressively lower its bits, and then we fine tune the model with quantized weights and activations. At the same time, we strategically choose which bitwidth to fine-tune on and to initialize from, and which bitwidth to fine-tune under augmented data to outperform the existing BERT binarization methods without adding an extra module, compressing the binary model 18% more than previous binarization methods or compressing BERT by 31x w.r.t. to the full-precision model. Our method without data augmentation can outperform existing BERT ternarization methods.
Anthology ID:
2022.sustainlp-1.2
Volume:
Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Angela Fan, Iryna Gurevych, Yufang Hou, Zornitsa Kozareva, Sasha Luccioni, Nafise Sadat Moosavi, Sujith Ravi, Gyuwan Kim, Roy Schwartz, Andreas Rücklé
Venue:
sustainlp
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–9
Language:
URL:
https://aclanthology.org/2022.sustainlp-1.2
DOI:
10.18653/v1/2022.sustainlp-1.2
Bibkey:
Cite (ACL):
Charles Le, Arash Ardakani, Amir Ardakani, Hang Zhang, Yuyan Chen, James Clark, Brett Meyer, and Warren Gross. 2022. Efficient Two-Stage Progressive Quantization of BERT. In Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 1–9, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
Efficient Two-Stage Progressive Quantization of BERT (Le et al., sustainlp 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.sustainlp-1.2.pdf
Video:
 https://aclanthology.org/2022.sustainlp-1.2.mp4