Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

Yuan Ge; Yilun Liu; Chi Hu; Weibin Meng; Shimin Tao; Xiaofeng Zhao; Mahong Xia; Zhang Li; Boxing Chen; Hao Yang; Bei Li; Tong Xiao (肖桐); Jingbo Zhu

doi:10.18653/v1/2024.emnlp-main.28

Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Mahong Xia, Zhang Li, Boxing Chen, Hao Yang, Bei Li, Tong Xiao, JingBo Zhu

Abstract

With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required by training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being affected by biases in GPT models, or reducing the diversity of the selected instruction dataset. In this paper, we propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR). CaR consists of two steps. The first step involves ranking instruction pairs using a scoring model that is well aligned with expert preferences (achieving an accuracy of 84.25%). The second step involves preserving dataset diversity through a clustering process. In our experiment, CaR selected a subset containing only 1.96% of Alpaca’s IT data, yet the underlying AlpaCaR model trained on this subset outperforms Alpaca by an average of 32.1% in GPT-4 evaluations. Furthermore, our method utilizes small models (550M parameters) and requires only 11.2% of the monetary cost compared to existing methods, making it easily deployable in industrial scenarios.

Anthology ID:: 2024.emnlp-main.28
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 464–478
Language:
URL:: https://aclanthology.org/2024.emnlp-main.28/
DOI:: 10.18653/v1/2024.emnlp-main.28
Bibkey:
Cite (ACL):: Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Mahong Xia, Zhang Li, Boxing Chen, Hao Yang, Bei Li, Tong Xiao, and JingBo Zhu. 2024. Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 464–478, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation (Ge et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.28.pdf

PDF Cite Search Fix data