Ngai Wong


pdf bib
LoRETTA: Low-Rank Economic Tensor-Train Adaptation for Ultra-Low-Parameter Fine-Tuning of Large Language Models
Yifan Yang | Jiajun Zhou | Ngai Wong | Zheng Zhang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Various parameter-efficient fine-tuning (PEFT) techniques have been proposed to enable computationally efficient fine-tuning while maintaining model performance. However, existing PEFT methods are still limited by the growing number of trainable parameters with the rapid deployment of Large Language Models (LLMs). To address this challenge, we present LoRETTA, an ultra-parameter-efficient framework that significantly reduces trainable parameters through tensor-train decomposition. Specifically, we propose two methods, named LoRETTA_adp and LoRETTA_rep. The former employs tensorized adapters, offering a high-performance yet lightweight approach for the fine-tuning of LLMs. The latter emphasizes fine-tuning via weight reparameterization with a set of small tensor factors. LoRETTA achieves comparable or better performance than most widely used PEFT methods with up to 100× fewer parameters on the LLaMA-2-7B models. Furthermore, empirical results demonstrate that the proposed methods exhibit remarkable anti-overfitting capability, effectively improve training efficiency, and enjoy better multi-task learning performance. Plug-and-play loretta library built upon the Huggingface framework and PEFT library are provided.

pdf bib
Weight-Inherited Distillation for Task-Agnostic BERT Compression
Taiqiang Wu | Cheng Hou | Shanshan Lao | Jiayi Li | Ngai Wong | Zhe Zhao | Yujiu Yang
Findings of the Association for Computational Linguistics: NAACL 2024

Knowledge Distillation (KD) is a predominant approach for BERT compression.Previous KD-based methods focus on designing extra alignment losses for the student model to mimic the behavior of the teacher model.These methods transfer the knowledge in an indirect way.In this paper, we propose a novel Weight-Inherited Distillation (WID), which directly transfers knowledge from the teacher.WID does not require any additional alignment loss and trains a compact student by inheriting the weights, showing a new perspective of knowledge distillation.Specifically, we design the row compactors and column compactors as mappings and then compress the weights via structural re-parameterization.Experimental results on the GLUE and SQuAD benchmarks show that WID outperforms previous state-of-the-art KD-based baselines.Further analysis indicates that WID can also learn the attention patterns from the teacher model without any alignment loss on attention distributions.The code is available at


pdf bib
Structured Pruning for Efficient Generative Pre-trained Language Models
Chaofan Tao | Lu Hou | Haoli Bai | Jiansheng Wei | Xin Jiang | Qun Liu | Ping Luo | Ngai Wong
Findings of the Association for Computational Linguistics: ACL 2023

The increasing sizes of large generative Pre-trained Language Models (PLMs) hinder their deploymentin real-world applications. To obtain efficient PLMs, previous studies mostly focus on pruning the attention heads and feed-forward networks (FFNs) of the Transformer. Nevertheless, we find that in generative PLMs, the hidden dimension shared by many other modules (e.g., embedding layer and layer normalization) contains persistent outliers regardless of the network input. This study comprehensively investigates the structured pruning of generative PLMs with all the above compressible components. To identify redundant network structures, we assign learnable masks over compressible components followed by sparse training. Various sizes of PLMs can be flexibly extracted via different thresholds, and are then task-specifically fine-tuned for further improvement. Extensive experiments on language modeling, summarization and machine translation validate the effectiveness of the proposed method. For example, the pruned BART brings 1.51x/6.96x inference speedup on GPU/CPU with 67% size reduction, and can be further combined with quantization for more than 25× compression.

pdf bib
Gradually Excavating External Knowledge for Implicit Complex Question Answering
Chang Liu | Xiaoguang Li | Lifeng Shang | Xin Jiang | Qun Liu | Edmund Lam | Ngai Wong
Findings of the Association for Computational Linguistics: EMNLP 2023

Recently, large language models (LLMs) have gained much attention for the emergence of human-comparable capabilities and huge potential. However, for open-domain implicit question-answering problems, LLMs may not be the ultimate solution due to the reasons of: 1) uncovered or out-of-date domain knowledge, 2) one-shot generation and hence restricted comprehensiveness. To this end, this work proposes a gradual knowledge excavation framework for open-domain complex question answering, where LLMs iteratively and actively acquire extrinsic information, then reason based on acquired historical knowledge. Specifically, during each step of the solving process, the model selects an action to execute, such as querying external knowledge or performing a single logical reasoning step, to gradually progress toward a final answer. Our method can effectively leverage plug-and-play external knowledge and dynamically adjust the strategy for solving complex questions. Evaluated on the StrategyQA dataset, our method achieves 78.17% accuracy with less than 6% parameters of its competitors, setting new SOTA in the ~10B LLM class.


pdf bib
Compression of Generative Pre-trained Language Models via Quantization
Chaofan Tao | Lu Hou | Wei Zhang | Lifeng Shang | Xin Jiang | Qun Liu | Ping Luo | Ngai Wong
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The increasing size of generative Pre-trained Language Models (PLMs) have greatly increased the demand for model compression. Despite various methods to compress BERT or its variants, there are few attempts to compress generative PLMs, and the underlying difficulty remains unclear. In this paper, we compress generative PLMs by quantization. We find that previous quantization methods fail on generative tasks due to the homogeneous word embeddings caused by reduced capacity and the varied distribution of weights. Correspondingly, we propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules. Empirical results on various tasks show that our proposed method outperforms the state-of-the-art compression methods on generative PLMs by a clear margin. With comparable performance with the full-precision models, we achieve 14.4x and 13.4x compression rate on GPT-2 and BART, respectively.