Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation

Large pre-trained models such as BERT are known to improve different downstream NLP tasks, even when such a model is trained on a generic domain. Moreover, recent studies have shown that when large domain-specific corpora are available, continued pre-training on domain-specific data can further improve the performance of in-domain tasks. However, this practice requires significant domain-specific data and computational resources which may not always be available. In this paper, we aim to adapt a generic pretrained model with a relatively small amount of domain-specific data. We demonstrate that by explicitly incorporating multi-granularity information of unseen and domain-specific words via the adaptation of (word based) n-grams, the performance of a generic pretrained model can be greatly improved. Specifically, we introduce a Transformer-based Domain-aware N-gram Adaptor, T-DNA, to effectively learn and incorporate the semantic representation of different combinations of words in the new domain. Experimental results illustrate the effectiveness of T-DNA on eight low-resource downstream tasks from four domains. We show that T-DNA is able to achieve significant improvements compared to existing methods on most tasks using limited data with lower computational costs. Moreover, further analyses demonstrate the importance and effectiveness of both unseen words and the information of different granularities. Our code is available at https://github.com/shizhediao/T-DNA.


Introduction
Pre-trained language models have achieved great success and shown promise in various application scenarios across natural language understanding (Devlin et al., 2019;Liu et al., 2019;Tian et al., 2020a) and generation (Lewis et al., 2020;Yang et al., 2020). Normally applying pre-trained language models to different applications follows a two-stage paradigm: pre-training on a large unlabeled corpus and then fine-tuning on a downstream task dataset. However, when there are domain gaps between pre-training and fine-tuning data, previous studies (Beltagy et al., 2019;Lee et al., 2020) have observed a performance drop caused by the incapability of generalization to new domains. Towards filling the gaps, the main research stream (Beltagy et al., 2019;Alsentzer et al., 2019;Huang et al., 2019;Lee et al., 2020) on adapting pre-trained language models starts from a generic model (e.g., BERT, RoBERTa) and then continues pre-training with similar objectives on a large-scale domain-specific corpus. However, without providing sufficient understanding of the reason for the performance drop during the domain shift, it is prone to failure of adaptation. Therefore, many aspects of continuous pre-training are expected to be enhanced. First, although generic pre-trained models offer better initialization for continuous pre-training models, it still costs considerable time (and money) that are beyond the reach of many institutions. 2 Second, it is clumsy to pre-train domain-specific models repeatedly for each domain on large-scale corpora. 3 Therefore, it is helpful to have an efficient and flexible method for being able to adapt pre-trained language models to different domains requiring limited resources.
Starting from the observed vocabulary mismatch problem (Gururangan et al., 2020), we further show empirically that the domain gap is largely caused by domain-specific n-grams. 4 Motivated by this find-3337 ing, we propose a light-weight Transformer-based Domain-aware N-gram Adaptor (T-DNA) by incorporating n-gram representations to bridge the domain gap between source and target vocabulary. Specifically, the proposed model is able to explicitly learn and incorporate better representations of domain-specific words and phrases (in the form of n-grams) by the adaptor networks with only requiring small pieces of data. With this adaptor, once entering a new domain, one can choose to train the adaptor alone or train it with a Transformer-based backbone (e.g., BERT) together, where the joint training paradigm could provide more improvement. In addition, although it is designed for a lowresource setting, the adaptor is still able to work with enough data, which ensures its generalization ability in different scenarios.
Experimental results demonstrate that T-DNA significantly improves domain adaptation performance based on a generic pre-trained model and outperforms all baselines on eight classification tasks (on eight datasets). The results confirm that incorporating domain-specific n-grams with the proposed T-DNA is an effective and efficient solution to domain adaptation, showing that the information carried by larger text granularity is highly important for language processing across domains. Moreover, further analyses investigate the factors that may influence the performance of our model, such as the amount of available data, the training time cost and efficiency, and the granularity of domain-specific information, revealing the best way and setting for using the model.

The Motivation
As observed in Gururangan et al. (2020), the transfer gain of domain-specific pre-training becomes increasingly significant when the source and target domain are vastly dissimilar in terms of the vocabulary overlap. Motivated by this association between transfer gain and vocabulary distribution, we further investigate the shift of words and phrases across domains and attempt to alleviate the degradation of language models without large domainspecific corpora.
In particular, we start with a RoBERTa-base model from the generic domain and then fine-tune it on the IMDB (Maas et al., 2011) dataset. We investigate the outputs predicted by the [CLS] embedding on the IMDB development set and divide them into two categories: correct predictions (true positive/negative) and false predictions (false positive/false negative). To examine the vocabulary mismatch problem during the domain shift, we extract the top 1K most frequent n-grams 5 from these two categories respectively. We identify the n-grams not in the top 10K most frequent n-grams of source data 6 as domain-specific n-grams. As revealed in Figure 1, a larger proportion of domainspecific n-grams are captured when the model is misled to make wrong predictions, which suggests that the shifts in semantic meaning for both words and phrases might account for the domain shift. Furthermore, we conjecture that the representations of domain-specific n-grams are unreliable, which exacerbates the model degradation. While more details will be presented in §6.3, we briefly mention here that the tokens usually improperly attend to other tokens in the sentence but omit the most important words and phrases. In light of this empirical evidence, we are motivated to design a framework to not only capture the domain-specific n-grams but also reliably embed them to extrapolate in the novel domain.

The T-DNA
Our approach follows the standard recipe of pretraining and fine-tuning a language model, which receives a sentence X = t 1 t 2 · · · t i · · · t T with t i indicating the i-th token, and outputs the representation of each token. The overall architecture of our approach is shown in Figure 2. In the middle, a generic pre-trained encoder, such 5 Here we set n to 5. 6 We sample a subset from English Wikipedia.

3338
Token Embedding Layer

Input
Subjective effects, psychomotor task performance, and physiological measures were … as BERT or RoBERTa, provides a representation at the subword-level without any target domain knowledge. The right-hand side shows the proposed T-DNA to enhance the backbone pre-trained encoder, where word based n-grams in X are extracted from a pre-constructed lexicon L, and are represented through n-gram attention module. The left-hand side shows the n-gram matching matrix and the integrating process of domain-specific representation and generic encoding.
In this section, we start with a detailed description of lexicon construction, then introduce our n-gram encoding module and how to integrate ngram encoding with the backbone model to get domain-aware representation, and end with an illustration of two training strategies.

Lexicon Construction and N-gram Extraction
To better represent and incorporate unseen and domain-specific n-grams, we first need to find and extract them. Here we propose to use an unsupervised method, pointwise mutual information (PMI), to find domain-specific words and phrases by collocations and associations between words. Given a sentence X = x 1 x 2 · · · x K with K words, for any two adjacent words (e.g.,x, x) within the sentence, their PMI is calculated by where p(x) is the probability of an n-gram x. When a high PMI score is detected between the adjacentx and x, it suggests they are good collocation pairs, because they have a high probability of cooccurrence and are more likely to form an n-gram. On the contrary, a delimiter is inserted between the two adjacent words if their P M I(x, x) is less than a threshold σ, i.e., X = x 1 x 2 · · ·x/ x · · · x K . As a result, those consecutive words without a delimiter are identified as candidate domain-specific n-grams. After using PMI to segment each sentence in the training set of a target task, we could select among candidate n-grams to obtain the final n-gram lexicon L, where each n-gram appears with a frequency of at least f . In light of this lexicon, for each training input sentence X = t 1 t 2 · · · t i · · · t T with T tokens, where t i denotes the i-th token of X , we extract those sub-strings of X that exist in the lexicon to form domain-specific n-gram sequence S = s 1 s 2 , · · · , s j , · · · , s N , with s j indicating the j-th n-gram of X . At the same time, an n-gram matching matrix, M ∈ R T ×N , can be built to record the positions of the extracted domain-specific n-gram set and its associated tokens, where m ij = 1 for t i ∈ s j and m ij = 0 for t i / ∈ s j . The matching matrix is shown in the left hand size of Figure 2.

Domain-aware Representation
The backbone pre-trained encoder is a Transformer architecture (Vaswani et al., 2017) with L layers, S self-attention heads and H hidden dimensions initialized from any pre-trained encoder (e.g., BERT or RoBERTa). The input sentence is passed through it, resulting in a generic hidden state h i for each input token x i . To get the domain-aware hidden representation, the n-gram adaptor network is implemented by a Transformer encoder with l layers, S self-attention heads and H hidden dimensions. First, the embeddings of domain-specific n-grams could be obtained by an n-gram embedding layer and then they are fed into the n-gram encoder to get a sequence of hidden states g via a multi-head attention mechanism. The n-gram encoder is able to model the interactions among all extracted ngrams and dynamically weighs n-grams to emphasize truly useful n-grams and ignores noisy information. The combination of the generic representation and domain-specific n-gram representation are computed by where h i is the desired domain-aware representation, and g i,k is the resulting hidden state for the i-th token and the k-th n-gram associated with this token according to the matching matrix M. The ngram encoding process and hidden state integration is repeated layer-by-layer along with the generic encoder for l layers from the bottom.

Training Strategies
Several training strategies could be used and we adopt two in our experiments: fine-tuning (FT) and task-adaptive pre-training (TAPT). For finetuning, we operate on the hidden state of the special classification token [CLS]. Following the tradition citation, we simply add a fully-connected layer as a classifier on top of the model and obtain the probabilities via a softmax layer. The classifier and the whole model are fine-tuned on the labeled task data in the target domain with cross-entropy loss.
To inject unsupervised target domain knowledge, we leverage the task-adaptive pre-training proposed in (Gururangan et al., 2020) which strips the labels in downstream task training data and trains the model on this unlabeled data. We use the masked language model (MLM) as our objective and do not include the next sentence prediction (NSP) task following Liu et al. (2019); Lan et al. (2020).
Note that, our model also supports other training strategies such as domain-adaptive pre-training, which proves to be effective in Gururangan et al. (2020). One can pre-train our model on a far larger domain corpus (normally beyond 10GB) at the beginning, and then do the task-adaptive pre-training and fine-tuning. Because our main goal is to adapt our model in a low-resource setting in terms of data size and time cost, we leave it for future research. 7

Experiment Settings
In this section, we first introduce eight benchmarking datasets. Then the baseline models, evaluation metrics, and implementation details are presented in the following three subsections, respectively.

Datasets
Following Gururangan et al. (2020), we conduct our experiments on eight classification tasks from four domains including biomedical sciences, computer science, news and reviews. The datasets are described as follows.    (Maas et al., 2011), 50,000 balanced positive and negative reviews from the Internet Movie Database for sentiment classification.
To create a low-resource setting, we constrain the size of all datasets into thousand-level. To do so, we randomly select a subset for RCT, AG, Amazon, IMDB with the ratio 1%, 1%, 1%, 10%, respectively. The details can be found in Table 1.

Baselines
In our experiments, the following two models serve as the main baselines.
• ROBERTA+FT: fine-tuned off-the-shelf RoBERTa-base model for downstream tasks. • ROBERTA+TAPT: task-adaptive pre-trained on unlabeled task data starting from RoBERTa and then fine-tuned on labeled data.

Evaluation Metrics
Following Beltagy et al. (2019), we adopt macro-F1 for CitationIntent, SciERC, HyperPartisan, AGNews, Amazon, IMDB, and micro-F1 for ChemProt and RCT as evaluation metrics. Macro-F1 will compute the F1 metric independently for each class and then take the average, whereas micro-F1 will aggregate the contributions of all classes to compute the average metric. In a multi-class classification setup, micro-F1 is preferable if there is class imbalance, which is true for ChemProt and RCT.

Implementation
We implement the RoBERTa-base architecture and initialize it with pre-trained weights by Huggingface's Transformers library 8 . In order to obtain a fast and warm start for n-gram representations, we utilize fastText (Bojanowski et al., 2017) to initialize n-gram embeddings. Considering the small amount of data and based on our experience, the number of N-gram encoding layers l is set to 1. For unsupervised task-adaptive pre-training (TAPT), the batch size is set to 16 and training epochs range from 10 to 15. We adopt Adam (Kingma and Ba, 2015) as the optimizer , where the corresponding learning rates of different datasets can be found in our code. The dropout rate is set to 0.5. For the task-specific fine-tuning (FT), we use similar hyperparameter settings and the details are elaborated in the Appendix. All the experiments are implemented on Nvidia V100 GPUs.

Experimental Results
We compare the performance of the RoBERTa model with and without T-DNA on the aforementioned datasets. In both fine-tuning and task adaptive pre-training experiments, T-DNA shows significant improvements over the pre-trained generic RoBERTa.  Table 2: The overall performance of T-DNA and the comparison against existing models on eight target downstream datasts. We report average scores across five random seeds, with standard deviations as subscripts.

Fine-Tuning
The results of fine-tuning on eight datasets are reported in Table 4. In general, the RoBERTa model with T-DNA outperforms that without T-DNA on all datasets, clearly indicating the effectiveness of T-DNA by emphasizing multi-granularity information. On average, T-DNA is able to bring an improvement of performance by around 2.66%. Across all eight datasets, it is observed that T-DNA achieves the greatest improvement (8.21%) on the CitationIntent dataset and the least improvement on the AGNews dataset. One reasonable explanation for different improvements is that the domain gap between the RoBERTa pre-training domain and the CS domain is the greatest so that far more gains could be obtained by an effective adaptation strategy. To confirm this, we follow Gururangan et al. (2020) to characterize the domain similarity by analyzing vocabulary overlap and we draw the same conclustion that RoBERTa's pretraining domain has a similar vocabulary to News and Reviews, but far more dissimilar vocabulary to BioMed and CS. In light of this observation, we recognize that the proposed method is more applicable when the domain gap is large. In this scenario, the potential of incorporating multi-grained information by domain-specific n-grams is greatly exploited to boost the performance of adaptation.
When comparing the improvements over four domains, T-DNA is able to offer 1.18%, 6.38%, 2.33%, 0.75% gains on BioMed, CS, News, Reviews, respectively. The improvement on the CS domain is the best while on the Reviews domain it is the poorest, which is consistent with previous analyses across datasets for similar reasons.

Task-Adaptive Pre-Training
In the previous section, we show that T-DNA is helpful in fine-tuning. Additionally, we would like to explore whether T-DNA is complementary to more training strategies, such as task-adaptive pretraining (TAPT). TAPT has been shown useful for pre-trained models in previous studies (Howard and Ruder, 2018;Gururangan et al., 2020), by pretraining on the unlabeled task dataset drawn from the task distribution. The experimental results of two models with and without T-DNA are reported in the bottom two rows in Table 4. From the results, we can clearly see that the model with T-DNA achieves better performance on all datasets compared to the generic RoBERTa model without T-DNA. The T-DNA helps to improve the performance by approximately 1.59% on average, which shows that the effectiveness of T-DNA does not vanish when combined with TAPT. Instead, it further leads to a large performance boost for pre-trained models, indicating that T-DNA is a complementary approach, where explicitly modeling domain-specific information helps the unsupervised learning of representations (i.e., the masked language model (MLM) pre-training objective).
Overall, for both FT and TAPT experiments, the results show that T-DNA significantly improves domain adaptation performance based on a generic pre-trained model. We attribute this improvement to the essential domain-specific semantic information that is carried by n-grams and the valid representation of n-grams from the T-DNA network.

Analyses
We analyze several aspects of T-DNA, including the effects of different granularities and the effects  of data size. In addition, we examine the attention mechanism to verify the effects of n-gram representations during the domain shift. The details are illustrated in this section.

Effects of Different Granularities
The lexical unit in RoBERTa is a subword obtained from byte pair encoding (BPE) (Sennrich et al., 2016) tokenization, resulting in a smaller token space and more training data for each token. Our approach provides coarse-grained information carried by the larger lexical units, n-gram.
To verify the contribution of larger granularity information, we compare the improvement brought by T-DNA with information of different granularities, for n from 0 to 3. Note that here n means that we extract and incorporate all n-grams with a length smaller or equal to n (within a certain granularity). For example, n = 3 means that we include all unigrams, bigrams and trigrams. Two consistent observations could be made. First, adding only 1-gram is able to bring improvements over 0-gram (i.e., without T-DNA) on all eight datasets, as shown in Figure 3. As we know, the tokens in the generic encoder are at the subword-level and our unigrams are at the word-level, which can be seen as a combination of subwords. Therefore, the results suggest that adding unseen words through our adaptor network is effective, which could enhance the interaction between subwords of the same word, especially for the new words in the target domain.
Moreover, based on 1-gram, involving larger granularity offer further gains. Comparing 2-gram and 3-gram v.s. 1-gram, the consistent improvements of T-DNA demonstrate that the potential boundary information presented by n-grams plays an essential role in learning representations by providing explicit and better guidance.

Effects of Data Size
In the previous section, we explored the virtue of incorporating multi-grained information under resource-limited settings, where only a small subset of specific datasets can be accessed. In addition, we are curious whether T-DNA could work well on a larger scale. To this end, we sample different ratios (i.e., 10%, 20%, 50%, 100%) of four datasets (i.e., RCT, AGNews, Amazon and IMDB) and investigate how T-DNA performs at different data scales. As shown in Table 3, the model with T-DNA always outperforms that without T-DNA w.r.t. any subsets of four datasets. This demonstrates that models with T-DNA could easily adapt to any size of dataset with the help of domainspecific n-gram information. However, it is also noted that the performance gains of our method decayed with the increase of the amount of training data, dropping from 1.24% (proportion=10%) to 0.36% (proportion=100%). It is not surprising because with adequate data, a model is able to learn a good representation with supervised learning without the need of prior knowledge. However, since sufficient data normally could not be accessed in reality, especially labeled data, we argue that T-DNA is desirable and necessary for domain adaptation.

Visualization of N-gram Representations
To verify the effects of n-gram representations during the domain shift, we examine the attention mechanism of RoBERTa and T-DNA by plotting the attention maps and salience maps using the LIT tool (Tenney et al., 2020). In the attention map of RoBERTa without T-DNA, we found that the tokens usually improperly attend to other tokens in the sentence. For example, in Figure 4, "Barbie" attributes more attentions to "animated" and "scary" but omits "creepy" and fails to capture "scary as hell" as an integrated phase. In contrast, when the model is equipped with T-DNA, this variant will shift its attention to include "creepy" and  force the model to focus on the informative phrase "scary as hell". Furthermore, the salience map of RoBERTa without T-DNA suggests that "animated" and "scary" dominate its prediction while "creepy" and "scary as hell" are captured by our T-DNA, which is consistent with the decision process of human beings. Due to the space limitations, more visualized examples are not shown here. However, based on considerable empirical evidence, we conclude that the unreliable representations of domain-specific n-grams (words and phrases) might be one of the main causes for model degradation.

Related Work
A large performance drop of pre-trained models caused by domain shift has been observed and many domain-specific BERT models (Beltagy et al., 2019;Alsentzer et al., 2019;Huang et al., 2019;Lee et al., 2020) have been introduced to bridge the domain gap. For example, SciBERT (Beltagy et al., 2019) is trained on 1.14M scientific papers from Semantic Scholar corpus (Ammar et al., 2018) for 7 days on TPU v3-8 machine and BioBERT (Lee et al., 2020) is trained on PubMed abstracts and PMC full text articles for 23 days on eight NVIDIA V100 GPUs. ClinicalBERT (Alsentzer et al., 2019) is trained on about 2 million notes in the MIMIC-III v1.4 database (Johnson et al., 2016) for 17-18 days on a single GeForce GTX TITAN X 12 GB GPU. However, they all incur a huge computational cost, which is not affordable for many university labs or institutions. This is precisely why we believe that our efficient adaptor is useful to the community. Although Gururangan et al. (2020) introduced task-adaptive pre-training (TAPT) to save time by training on unlabeled downstream task data, we demonstrate that our plug-in adaptor is faster and more effective because of the explicit learning strategy and efficient model architecture.
Out of vocabulary (OOV) words refer to those words that are not in the vocabulary list and have received a lot of attention in recent years. One way to handle OOV words is to simply utilize and learn an "unknown" embedding during training. Another way is to add in-domain words into the original vocabulary list and learn their representation by pretraining from scratch (Beltagy et al., 2019;Gu et al., 2020), which requires substantial resources and training data. Moreover, SciBERT (Beltagy et al., 2019) found that in-domain vocabulary is helpful but not significant while we attribute it to the inefficiency of implicit learning of in-domain vocabulary. To represent OOV words in multilingual settings, the mixture mapping method (Wang et al., 2019) utilized a mixture of English subwords embedding, but it has been shown useless for domain-specific words by Tai et al. (2020). ExBERT (Tai et al., 2020) applied an extension module to adapt an augmenting embedding for the in-domain vocabulary but it still needs large continuous pre-training. Similar to our work, they highlight the importance of the domain-specific words but all of these work neither explore the understanding of performance drop during a domain shift nor examine the importance of multi-grained information. Large granularity contextual information carried by spans or n-grams has proven to be helpful to enhance text representation for Chinese (Song et al., 2009;Song and Xia, 2012;Ouyang et al., 2017;Kim et al., 2018;Peng et al., 2018;Higashiyama et al., 2019;Tian et al., 2020e,b;Diao et al., 2020;Song et al., 2021) and English (Joshi et al., 2020;Xiao et al., 2020;Tian et al., 2020c,d). In addition to text encoders on pre-training, the kNN-LM (Khandelwal et al., 2019) proposes to augment the language model for effective domain adaptation, by varying the nearest neighbor datastore of similar contexts without further training. However, all of the previous studies focused on either general pre-training procedures or different tasks (e.g., language modeling), and did not explore the effectiveness of multigrained information for domain adaptation. We hence view them as orthogonal to our work.

Conclusion
In this work, we first reveal a novel discovery behind the performance drop during a domain shift, demonstrating that an unreliable representation of domain-specific n-grams causes the failure of adaptation. To this end, we propose an innovative adaptor network for generic pre-trained encoders, supporting many training strategies such as taskadaptive pre-training and fine-tuning, both leading to significant improvements to eight classification datasets from four domains (biomedical, computer science, news and reviews). Our method is easy to implement, simple but effective, implying that explicitly representing and incorporating domainspecific n-grams offer large gains. In addition, further analyses consistently demonstrate the importance and effectiveness of both unseen words and the information carried by coarse-grained n-grams.

D Evaluation Measures
We use manual tuning and adopt macro-F1 for CitationIntent, SciERC, HyperPartisan, AGNews, Amazon, IMDB, and micro-F1 for ChemProt and RCT as evaluation metrics. Macro-F1 will compute the F1 metric independently for each class and then take the average, whereas micro-F1 will aggregate the contributions of all classes to compute the average metric. In a multi-class classification setup, micro-F1 is preferable if there is class imbalance, which is true for ChemProt and RCT.