Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

Shahriar Golchin, Mihai Surdeanu, Nazgol Tavabi, Ata Kiapour


Abstract
We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pre-trained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7-15% of the pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).
Anthology ID:
2023.repl4nlp-1.2
Volume:
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Burcu Can, Maximilian Mozes, Samuel Cahyawijaya, Naomi Saphra, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Chen Zhao, Isabelle Augenstein, Anna Rogers, Kyunghyun Cho, Edward Grefenstette, Lena Voita
Venue:
RepL4NLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13–21
Language:
URL:
https://aclanthology.org/2023.repl4nlp-1.2
DOI:
10.18653/v1/2023.repl4nlp-1.2
Bibkey:
Cite (ACL):
Shahriar Golchin, Mihai Surdeanu, Nazgol Tavabi, and Ata Kiapour. 2023. Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords. In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023), pages 13–21, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords (Golchin et al., RepL4NLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.repl4nlp-1.2.pdf