Zero-Shot Dynamic Quantization for Transformer Inference

Yousef El-kurdi, Jerry Quinn, Avi Sil


Abstract
We introduce a novel run-time method for significantly reducing the accuracy loss associated with quantizing BERT-like models to 8-bit integers. Existing methods for quantizing models either modify the training procedure, or they require an additional calibration step to adjust parameters that also requires a selected held-out dataset. Our method permits taking advantage of quantization without the need for these adjustments. We present results on several NLP tasks demonstrating the usefulness of this technique.
Anthology ID:
2022.emnlp-industry.45
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
December
Year:
2022
Address:
Abu Dhabi, UAE
Editors:
Yunyao Li, Angeliki Lazaridou
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
451–457
Language:
URL:
https://aclanthology.org/2022.emnlp-industry.45
DOI:
10.18653/v1/2022.emnlp-industry.45
Bibkey:
Cite (ACL):
Yousef El-kurdi, Jerry Quinn, and Avi Sil. 2022. Zero-Shot Dynamic Quantization for Transformer Inference. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 451–457, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Zero-Shot Dynamic Quantization for Transformer Inference (El-kurdi et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-industry.45.pdf