Yefan Zhou
2024
Model Balancing Helps Low-data Training and Fine-tuning
Zihang Liu
|
Yuanzhe Hu
|
Tianyu Pang
|
Yefan Zhou
|
Pu Ren
|
Yaoqing Yang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recent advances in foundation models have emphasized the need to align pre-trained models with specialized domains using small, curated datasets. Studies on these foundation models underscore the importance of low-data training and fine-tuning. This topic, well-known in natural language processing (NLP), has also gained increasing attention in the emerging field of scientific machine learning (SciML). To address the limitations of low-data training and fine-tuning, we draw inspiration from Heavy-Tailed Self-Regularization (HT-SR) theory, analyzing the shape of empirical spectral densities (ESDs) and revealing an imbalance in training quality across different model layers. To mitigate this issue, we adapt a recently proposed layer-wise learning rate scheduler, TempBalance, which effectively balances training quality across layers and enhances low-data training and fine-tuning for both NLP and SciML tasks. Notably, TempBalance demonstrates increasing performance gains as the amount of available tuning data decreases. Comparative analyses further highlight the effectiveness of TempBalance and its adaptability as an “add-on” method for improving model performance.
AlphaLoRA: Assigning LoRA Experts Based on Layer Training Quality
Peijun Qing
|
Chongyang Gao
|
Yefan Zhou
|
Xingjian Diao
|
Yaoqing Yang
|
Soroush Vosoughi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Parameter-efficient fine-tuning methods, such as Low-Rank Adaptation (LoRA), are known to enhance training efficiency in Large Language Models (LLMs). Due to the limited parameters of LoRA, recent studies seek to combine LoRA with Mixture-of-Experts (MoE) to boost performance across various tasks. However, inspired by the observed redundancy in traditional MoE structures, prior studies find that LoRA experts within the MoE architecture also exhibit redundancy, suggesting a need to vary the allocation of LoRA experts across different layers. In this paper, we leverage Heavy-Tailed Self-Regularization (HT-SR) Theory to design a fine-grained allocation strategy. Our analysis reveals that the number of experts per layer correlates with layer training quality, which exhibits significant variability across layers. Based on this, we introduce AlphaLoRA, a theoretically principled and training-free method for allocating LoRA experts to reduce redundancy further. Experiments on three models across ten language processing and reasoning benchmarks demonstrate that AlphaLoRA achieves comparable or superior performance over all baselines. Our code is available at https://github.com/morelife2017/alphalora.
Search
Co-authors
- Yaoqing Yang 2
- Zihang Liu 1
- Yuanzhe Hu 1
- Tianyu Pang 1
- Pu Ren 1
- show all...