Yu Mao
2024
When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models
Weilan Wang
|
Yu Mao
|
Tang Dongdong
|
Du Hongchao
|
Nan Guan
|
Chun Jason Xue
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a framework to compress LLM after quantization further, achieving about 2.2x compression ratio. A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further. Upon this, we notice that decompression can be a bottleneck during practical scenarios. We then give a detailed analysis of the trade-off between memory usage and latency brought by the proposed method. A speed-adaptive method is proposed to overcome it. The experimental results show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.
2015
The Discovery of Natural Typing Annotations: User-produced Potential Chinese Word Delimiters
Dakui Zhang
|
Yu Mao
|
Yang Liu
|
Hanshi Wang
|
Chuyuan Wei
|
Shiping Tang
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Search
Co-authors
- Weilan Wang 1
- Tang Dongdong 1
- Du Hongchao 1
- Nan Guan 1
- Chun Jason Xue 1
- show all...