Cross-Domain Language Modeling: An Empirical Investigation

Vincent Nguyen; Sarvnaz Karimi; Maciej Rybinski; Zhenchang Xing

Cross-Domain Language Modeling: An Empirical Investigation

Vincent Nguyen, Sarvnaz Karimi, Maciej Rybinski, Zhenchang Xing

Abstract

Transformer encoder models exhibit strong performance in single-domain applications. However, in a cross-domain situation, using a sub-word vocabulary model results in sub-word overlap. This is an issue when there is an overlap between sub-words that share no semantic similarity between domains. We hypothesize that alleviating this overlap allows for a more effective modeling of multi-domain tasks; we consider the biomedical and general domains in this paper. We present a study on reducing sub-word overlap by scaling the vocabulary size in a Transformer encoder model while pretraining with multiple domains. We observe a significant increase in downstream performance in the general-biomedical cross-domain from a reduction in sub-word overlap.

Anthology ID:: 2021.alta-1.22
Volume:: Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association
Month:: December
Year:: 2021
Address:: Online
Editors:: Afshin Rahimi, William Lane, Guido Zuccon
Venue:: ALTA
SIG:
Publisher:: Australasian Language Technology Association
Note:
Pages:: 192–200
Language:
URL:: https://aclanthology.org/2021.alta-1.22/
DOI:
Bibkey:
Cite (ACL):: Vincent Nguyen, Sarvnaz Karimi, Maciej Rybinski, and Zhenchang Xing. 2021. Cross-Domain Language Modeling: An Empirical Investigation. In Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association, pages 192–200, Online. Australasian Language Technology Association.
Cite (Informal):: Cross-Domain Language Modeling: An Empirical Investigation (Nguyen et al., ALTA 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.alta-1.22.pdf

PDF Cite Search Fix data