2024
pdf
bib
abs
XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts
Yifeng Ding
|
Jiawei Liu
|
Yuxiang Wei
|
Lingming Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce XFT, a simple yet powerful training scheme, by simply merging upcycled Mixture-of-Experts (MoE) to unleash the performance limit of instruction-tuned code Large Language Models (LLMs). While vanilla sparse upcycling fails to improve instruction tuning, XFT introduces a shared expert mechanism with a novel routing weight normalization strategy into sparse upcycling, which significantly boosts instruction tuning. After fine-tuning the upcycled MoE model, XFT introduces a learnable model merging mechanism to compile the upcycled MoE model back to a dense model, achieving upcycled MoE-level performance with only dense-model compute. By applying XFT to a 1.3B model, we create a new state-of-the-art tiny code LLM with 67.1 and 64.6 pass@1 on HumanEval and HumanEval+ respectively. With the same data and model architecture, XFT improves supervised fine-tuning (SFT) by 13% on HumanEval+, along with consistent improvements from 2% to 13% on MBPP+, MultiPL-E, and DS-1000, demonstrating its generalizability. XFT is fully orthogonal to existing techniques such as Evol-Instruct and OSS-Instruct, opening a new dimension for improving code instruction tuning. Codes are available at https://github.com/ise-uiuc/xft.
2023
pdf
bib
abs
Black-Box Tuning of Vision-Language Models with Effective Gradient Approximation
Zixian Guo
|
Yuxiang Wei
|
Ming Liu
|
Zhilong Ji
|
Jinfeng Bai
|
Yiwen Guo
|
Wangmeng Zuo
Findings of the Association for Computational Linguistics: EMNLP 2023
Parameter-efficient fine-tuning (PEFT) methods have provided an effective way for adapting large vision-language models to specific tasks or scenarios. Typically, they learn a very small scale of parameters for pre-trained models in a white-box formulation, which assumes model architectures to be known and parameters to be accessible. However, large models are often not open-source due to considerations of preventing abuse or commercial factors, hence posing a barrier to the deployment of white-box PEFT methods. To alleviate the dependence on model accessibility, we introduce collaborative black-box tuning (CBBT) for both textual prompt optimization and output feature adaptation for black-box models. Specifically, considering that the backpropagation gradients are blocked, we approximate the gradients of textual prompts by analyzing the predictions with perturbed prompts. Secondly, a lightweight adapter is deployed over the output feature of the inaccessible model, further facilitating the model adaptation process. Empowered with these designs, our CBBT is extensively evaluated on eleven downstream benchmarks and achieves remarkable improvements compared to existing black-box VL adaptation methods. Our code will be made publicly available.
2022
pdf
bib
abs
Entropy as a measurement of cognitive load in translation
Yuxiang Wei
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Workshop 1: Empirical Translation Process Research)
In view of the “predictive turn” in translation studies, empirical investigations of the translation process have shown increasing interest in studying features of the text which can predict translation efficiency and effort, especially using large-scale experimental data and rigorous statistical means. In this regard, a novel metric based on entropy (i.e., HTra) has been proposed and experimentally studied as a predictor variable. On the one hand, empirical studies show that HTra as a product-based metric can predict effort, and on the other, some conceptual analyses have provided theoretical justifications of entropy or entropy reduction as a description of translation from a process perspective. This paper continues the investigation of entropy, conceptually examining two ways of quantifying cognitive load, namely, shift of resource allocation and reduction of entropy, and argues that the former is represented by surprisal and ITra while the latter is represented by HTra. Both can be approximated via corpus-based means and used as potential predictors of effort. Empirical analyses were also conducted comparing the two metrics (i.e., HTra and ITra) in terms of their prediction of effort, which showed that ITra is a stronger predictor for TT production time while HTra is a stronger predictor for ST reading time. It is hoped that this would contribute to the exploration of dependable, theoretically justifiable means of predicting the effort involved in translation
2019
pdf
bib
Predicting Cognitive Effort in Translation Production
Yuxiang Wei
Proceedings of the Second MEMENTO workshop on Modelling Parameters of Cognitive Effort in Translation Production