Dongsoo Lee
2022
AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models
Se Jung Kwon
|
Jeonghoon Kim
|
Jeongin Bae
|
Kang Min Yoo
|
Jin-Hwa Kim
|
Baeseong Park
|
Byeongwook Kim
|
Jung-Woo Ha
|
Nako Sung
|
Dongsoo Lee
Findings of the Association for Computational Linguistics: EMNLP 2022
There are growing interests in adapting large-scale language models using parameter-efficient fine-tuning methods. However, accelerating the model itself and achieving better inference efficiency through model compression has not been thoroughly explored yet.Model compression could provide the benefits of reducing memory footprints, enabling low-precision computations, and ultimately achieving cost-effective inference.To combine parameter-efficient adaptation and model compression, we propose AlphaTuning consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task.Specifically, AlphaTuning works by employing binary-coding quantization, which factorizes the full-precision parameters into binary parameters and a separate set of scaling factors.During the adaptation phase, the binary values are frozen for all tasks, while the scaling factors are fine-tuned for the downstream task.We demonstrate that AlphaTuning, when applied to GPT-2 and OPT, performs competitively with full fine-tuning on a variety of downstream tasks while achieving >10x compression ratio under 4-bit quantization and >1,000x reduction in the number of trainable parameters.
2020
Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation
Insoo Chung
|
Byeongwook Kim
|
Yoonjung Choi
|
Se Jung Kwon
|
Yongkweon Jeon
|
Baeseong Park
|
Sangha Kim
|
Dongsoo Lee
Findings of the Association for Computational Linguistics: EMNLP 2020
The deployment of widely used Transformer architecture is challenging because of heavy computation load and memory overhead during inference, especially when the target device is limited in computational resources such as mobile or edge devices. Quantization is an effective technique to address such challenges. Our analysis shows that for a given number of quantization bits, each block of Transformer contributes to translation quality and inference computations in different manners. Moreover, even inside an embedding block, each word presents vastly different contributions. Correspondingly, we propose a mixed precision quantization strategy to represent Transformer weights by an extremely low number of bits (e.g., under 3 bits). For example, for each word in an embedding block, we assign different quantization bits based on statistical property. Our quantized Transformer model achieves 11.8× smaller model size than the baseline model, with less than -0.5 BLEU. We achieve 8.3× reduction in run-time memory footprints and 3.5× speed up (Galaxy N10+) such that our proposed compression strategy enables efficient implementation for on-device NMT.
Search
Fix data
Co-authors
- Byeongwook Kim 2
- Se Jung Kwon 2
- Baeseong Park 2
- Jeongin Bae 1
- Yoonjung Choi 1
- show all...