Numerical Optimizations for Weighted Low-rank Estimation on Language Models
Ting Hua | Yen-Chang Hsu | Felicity Wang | Qian Lou | Yilin Shen | Hongxia Jin
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Singular value decomposition (SVD) is one of the most popular compression methods that approximate a target matrix with smaller matrices. However, standard SVD treats the parameters within the matrix with equal importance, which is a simple but unrealistic assumption. The parameters of a trained neural network model may affect the task performance unevenly, which suggests non-equal importance among the parameters. Compared to SVD, the decomposition method aware of parameter importance is the more practical choice in real cases. Unlike standard SVD, weighed value decomposition is a non-convex optimization problem that lacks a closed-form solution. We systematically investigated multiple optimization strategies to tackle the problem and examined our method by compressing Transformer-based language models.Further, we designed a metric to predict when the SVD may introduce a significant performance drop, for which our method can be a rescue strategy.The extensive evaluations demonstrate that our method can perform better than current SOTA methods in compressing Transformer-based language models.
Hyperparameter-free Continuous Learning for Domain Classification in Natural Language Understanding
Ting Hua | Yilin Shen | Changsheng Zhao | Yen-Chang Hsu | Hongxia Jin
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Domain classification is the fundamental task in natural language understanding (NLU), which often requires fast accommodation to new emerging domains. This constraint makes it impossible to retrain all previous domains, even if they are accessible to the new model. Most existing continual learning approaches suffer from low accuracy and performance fluctuation, especially when the distributions of old and new data are significantly different. In fact, the key real-world problem is not the absence of old data, but the inefficiency to retrain the model with the whole old dataset. Is it potential to utilize some old data to yield high accuracy and maintain stable performance, while at the same time, without introducing extra hyperparameters? In this paper, we proposed a hyperparameter-free continual learning model for text data that can stably produce high performance under various environments. Specifically, we utilize Fisher information to select exemplars that can “record” key information of the original model. Also, a novel scheme called dynamical weight consolidation is proposed to enable hyperparameter-free learning during the retrain process. Extensive experiments demonstrate baselines provide fluctuated performance which makes them useless in practice. On the contrary, our proposed model significantly and consistently outperforms the best state-of-the-art method by up to 20% in average accuracy, and each of its component contributes effectively to overall performance.
- Yen-Chang Hsu 2
- Yilin Shen 2
- Hongxia Jin 2
- Felicity Wang 1
- Qian Lou 1
- show all...