CodeM: Less Data Yields More Versatility via Ability Matrix

Daoguang Zan, Ailun Yu, Wei Liu, Bo Shen, Shaoxin Lin, Yongshun Gong, Yafen Yao, Yan Liu, Bei Guan, Weihua Luo, Yongji Wang, Qianxiang Wang, Lizhen Cui


Abstract
In the era of code large language models (code LLMs), data engineering plays a pivotal role during the instruction fine-tuning phase. To train a versatile model, previous efforts devote tremendous efforts into crafting instruction data covering all the downstream scenarios. Nonetheless, this will incur significant expenses in constructing data and training model. Therefore, this paper introduces CodeM, a novel data construction strategy, which can efficiently train a versatile model using less data via our newly proposed ability matrix. CodeM uses ability matrix to decouple code LLMs’ abilities into two dimensions, constructing a lightweight training corpus that only covers a subset of target scenarios. Extensive experiments on HumanEvalPack and MultiPL-E imply that code LLMs can combine the single-dimensional abilities to master composed abilities, validating the effectiveness of CodeM.
Anthology ID:
2024.findings-acl.40
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
714–729
Language:
URL:
https://aclanthology.org/2024.findings-acl.40
DOI:
Bibkey:
Cite (ACL):
Daoguang Zan, Ailun Yu, Wei Liu, Bo Shen, Shaoxin Lin, Yongshun Gong, Yafen Yao, Yan Liu, Bei Guan, Weihua Luo, Yongji Wang, Qianxiang Wang, and Lizhen Cui. 2024. CodeM: Less Data Yields More Versatility via Ability Matrix. In Findings of the Association for Computational Linguistics ACL 2024, pages 714–729, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
CodeM: Less Data Yields More Versatility via Ability Matrix (Zan et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.40.pdf