Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

Xiao Wang, Weikang Zhou, Qi Zhang, Jie Zhou, SongYang Gao, Junzhe Wang, Menghan Zhang, Xiang Gao, Yun Wen Chen, Tao Gui


Abstract
Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, which has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end task. Furthermore, we design a gradient matching-based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45% of the data and a three-orders-of-magnitude lower computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.
Anthology ID:
2023.findings-acl.35
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
555–568
Language:
URL:
https://aclanthology.org/2023.findings-acl.35
DOI:
10.18653/v1/2023.findings-acl.35
Bibkey:
Cite (ACL):
Xiao Wang, Weikang Zhou, Qi Zhang, Jie Zhou, SongYang Gao, Junzhe Wang, Menghan Zhang, Xiang Gao, Yun Wen Chen, and Tao Gui. 2023. Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model. In Findings of the Association for Computational Linguistics: ACL 2023, pages 555–568, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model (Wang et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.35.pdf
Video:
 https://aclanthology.org/2023.findings-acl.35.mp4