exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources

Wen Tai, H. T. Kung, Xin Dong, Marcus Comiter, Chang-Fu Kuo


Abstract
We introduce exBERT, a training method to extend BERT pre-trained models from a general domain to a new pre-trained model for a specific domain with a new additive vocabulary under constrained training resources (i.e., constrained computation and data). exBERT uses a small extension module to learn to adapt an augmenting embedding for the new domain in the context of the original BERT’s embedding of a general vocabulary. The exBERT training method is novel in learning the new vocabulary and the extension module while keeping the weights of the original BERT model fixed, resulting in a substantial reduction in required training resources. We pre-train exBERT with biomedical articles from ClinicalKey and PubMed Central, and study its performance on biomedical downstream benchmark tasks using the MTL-Bioinformatics-2016 datasets. We demonstrate that exBERT consistently outperforms prior approaches when using limited corpus and pre-training computation resources.
Anthology ID:
2020.findings-emnlp.129
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Venues:
EMNLP | Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1433–1439
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.129
DOI:
10.18653/v1/2020.findings-emnlp.129
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2020.findings-emnlp.129.pdf