DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

Wenjing Ke, Zhe Li, Dong Li, Lu Tian, Emad Barsoum


Abstract
Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream tasks. Quantization-aware Training (QAT) can alleviate this problem, but it requires significantly more computational resources. To tackle this, we introduced Weight-Decomposed Low-Rank Quantization-Aware Training (DL-QAT), which merges the advantages of QAT while training only less than 1% of the total parameters. Specifically, we introduce a group-specific quantization magnitude to adjust the overall scale of each quantization group. Within each quantization group, we use LoRA matrices to update the weight size and direction in the quantization space. We validated the effectiveness of our method on the LLaMA and LLaMA2 model families. The results show significant improvements over our baseline method across different quantization granularities. For instance, for LLaMA-7B, our approach outperforms the previous state-of-the-art method by 4.2% in MMLU on 3-bit LLaMA-7B. Additionally, our quantization results on pre-trained models also surpass previous QAT methods, demonstrating the superior performance and efficiency of our approach.
Anthology ID:
2024.emnlp-industry.10
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
November
Year:
2024
Address:
Miami, Florida, US
Editors:
Franck Dernoncourt, Daniel Preoţiuc-Pietro, Anastasia Shimorina
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
113–119
Language:
URL:
https://aclanthology.org/2024.emnlp-industry.10
DOI:
Bibkey:
Cite (ACL):
Wenjing Ke, Zhe Li, Dong Li, Lu Tian, and Emad Barsoum. 2024. DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 113–119, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):
DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models (Ke et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-industry.10.pdf
Poster:
 2024.emnlp-industry.10.poster.pdf