Instruction Pre-Training: Language Models are Supervised Multitask Learners

Daixuan Cheng; Yuxian Gu; Shaohan Huang; Junyu Bi; Minlie Huang; Furu Wei

doi:10.18653/v1/2024.emnlp-main.148

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, Furu Wei

Abstract

Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-training. In pre-training from scratch, Instruction Pre-training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.

Anthology ID:: 2024.emnlp-main.148
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2529–2550
Language:
URL:: https://aclanthology.org/2024.emnlp-main.148/
DOI:: 10.18653/v1/2024.emnlp-main.148
Bibkey:
Cite (ACL):: Daixuan Cheng, Yuxian Gu, Shaohan Huang, Junyu Bi, Minlie Huang, and Furu Wei. 2024. Instruction Pre-Training: Language Models are Supervised Multitask Learners. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2529–2550, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Instruction Pre-Training: Language Models are Supervised Multitask Learners (Cheng et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.148.pdf

PDF Cite Search Fix data