Cross-Domain Language Modeling: An Empirical Investigation

Vincent Nguyen, Sarvnaz Karimi, Maciej Rybinski, Zhenchang Xing


Abstract
Transformer encoder models exhibit strong performance in single-domain applications. However, in a cross-domain situation, using a sub-word vocabulary model results in sub-word overlap. This is an issue when there is an overlap between sub-words that share no semantic similarity between domains. We hypothesize that alleviating this overlap allows for a more effective modeling of multi-domain tasks; we consider the biomedical and general domains in this paper. We present a study on reducing sub-word overlap by scaling the vocabulary size in a Transformer encoder model while pretraining with multiple domains. We observe a significant increase in downstream performance in the general-biomedical cross-domain from a reduction in sub-word overlap.
Anthology ID:
2021.alta-1.22
Volume:
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association
Month:
December
Year:
2021
Address:
Online
Editors:
Afshin Rahimi, William Lane, Guido Zuccon
Venue:
ALTA
SIG:
Publisher:
Australasian Language Technology Association
Note:
Pages:
192–200
Language:
URL:
https://aclanthology.org/2021.alta-1.22
DOI:
Bibkey:
Cite (ACL):
Vincent Nguyen, Sarvnaz Karimi, Maciej Rybinski, and Zhenchang Xing. 2021. Cross-Domain Language Modeling: An Empirical Investigation. In Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association, pages 192–200, Online. Australasian Language Technology Association.
Cite (Informal):
Cross-Domain Language Modeling: An Empirical Investigation (Nguyen et al., ALTA 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.alta-1.22.pdf
Data
BLUEGLUEQNLI