Efficient Continual Pre-training for Building Domain Specific Large Language Models

Yong Xie, Karan Aggarwal, Aitzaz Ahmad


Abstract
Large language models (LLMs) have demonstrated remarkable open-domain capabilities. LLMs tailored for a domain are typically trained entirely on domain corpus to excel at handling domain-specific tasks. In this work, we explore an alternative strategy of continual pre-training as a means to develop domain-specific LLMs over an existing open-domain LLM. We introduce FinPythia-6.9B, developed through domain-adaptive continual pre-training on the financial domain.Continual pre-trained FinPythia showcases consistent improvements on financial tasks over the original foundational model. We further explore simple but effective data selection strategies for continual pre-training. Our data selection strategies outperform vanilla continual pre-training’s performance with just 10% of corpus size and cost, without any degradation on open-domain standard tasks. Our work proposes an alternative solution to building domain-specific LLMs cost-effectively.
Anthology ID:
2024.findings-acl.606
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10184–10201
Language:
URL:
https://aclanthology.org/2024.findings-acl.606
DOI:
Bibkey:
Cite (ACL):
Yong Xie, Karan Aggarwal, and Aitzaz Ahmad. 2024. Efficient Continual Pre-training for Building Domain Specific Large Language Models. In Findings of the Association for Computational Linguistics ACL 2024, pages 10184–10201, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Efficient Continual Pre-training for Building Domain Specific Large Language Models (Xie et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.606.pdf