Snapshot-Guided Domain Adaptation for ELECTRA

,


Introduction
Pre-trained language models (Devlin et al., 2019;Clark et al., 2020) have demonstrated significant capabilities in various NLP tasks.While most language models follow the BERT-style (Devlin et al., 2019) to predict original tokens of the masked positions, ELECTRA (Clark et al., 2020) trains a discriminator to predict whether each token in a corrupted input is replaced by a generator.BERT mainly learns from the masked subset of input tokens, but ELECTRA could predict all input tokens, significantly improving sample efficiency and leading to strong results on general tasks (Chi et al., 2022;He et al., 2021;Meng et al., 2021;Xu et al., 2020;Meng et al., 2022;Bajaj et al., 2022;Shen et al., 2021).Domain adaptation of BERT-style models has been proved to consistently improve on the domain-related tasks (Gururangan et al., 2020;Lee et al., 2020;Yao et al., 2021).However, the adaptation of ELECTRA is still under-explored.This leads us to investigate domain adaptation of * Contribution during internship at Microsoft

Generator
Snapshot of Generator hypotension occurred with standing changes occurred with standing
ELECTRA, so as to optimize its performance on the domain-related tasks.
Recent research suggests that domain-specific tokens and texts can benefit the pre-trained models in certain domains and tasks (Gururangan et al., 2020;Lee et al., 2020;Gu et al., 2020).Specifically, Gu et al. (2020) proposed a task-specific method to selectively mask domain-specific tokens in pretraining.However, the pre-trained model for one task could be a deterrent for the other tasks in the same domain.On the other hand, task-agnostic methods (Gururangan et al., 2020;Lee et al., 2020;Miolo et al., 2021) are more widely applicablethe model trained once on the domain can be used for multiple downstream tasks in the domain.
In this paper, we propose SnapshOt-guided Domain Adaptation (SODA) for ELECTRA, which is also agnostic to downstream tasks.SODA leverages the difference between the generator at the current training step and the one at an earlier step to imitate the domain shift during adaption, which is then employed to dynamically find the domain-specific tokens.During continued pre-training on the domain, the ELECTRA generator of an earlier training step is named as the snapshot.As shown in Figure 1, given the masked input, SODA finds the domain-specific token "hypotension" by comparing the generator to the snapshot, and then emphasizes the domain-specific token by re-weighting the discriminator loss.In model implementation, the snapshot is loaded from a saved checkpoint of an earlier step, thus no additional training parameters are introduced.Furthermore, SODA employs different snapshots during different training intervals, to dynamically select the tokens specific to the domain shift at hand (van der Wees et al., 2017).
We conduct experiments in both computer science and biomedical domains, SODA achieves state-of-the-art results on the domain-related tasks.We also evaluate different methods to select domain-specific tokens, to demonstrate the effectiveness of our method.
In summary, our contributions include: • To the best of our knowledge, we are the first to explore domain adaptation for ELECTRA.• We propose a snapshot-guided domain adaptation method to dynamically emphasize domain-specific tokens.• According to the experimental results in two specific domains, SODA achieves promising performances on four domain-related tasks.
2 Background: ELECTRA ELECTRA (Clark et al., 2020) trains a discriminator to predict whether each token in a corrupted input is replaced by a generator.Given an original sequence x = [x 1 , x 2 , ..., x n ], 15% of the tokens are randomly replaced with [MASK] symbols.For each masked position i, the generator predicts a distribution p G (x|h i ), and then samples one token x R i ∼ p G (x|h i ) to replace the original token x i , resulting in a corrupted sequence x R .Here {h i } n i=1 are the contextualized representations generated by the Transformer.
Given the corrupted sequence x R , the discriminator D is trained to distinguish each replaced token x R i against the original token x i via the binary classification loss: where p D x R i = x i |h i = sigmoid w ⊤ h i and w is a learnable weight vector.

Method
Selecting Domain-specific Tokens.Grangier and Iter (2022); Moore and Lewis (2010) revealed that the domain-specific data can be selected according to the prediction differences between an in-domain model and an out-of-domain model.SODA selects domain-specific tokens with the help of a snapshot, where the snapshot represents the generator of an earlier step.We assume that the snapshot with fewer training steps is more "outof-domain" than the current generator.Based on this assumption, we could select domain-specific tokens by comparing predictions of the current generator with those of the snapshot.
Specifically, for each masked position i in the input, the generator G and the snapshot S each predict a distribution.Suppose their predictions are p G (x|h i ) and p S (x|h i ) respectively, then we make a binary decision b G,S (x i ) of whether token where x G i and x S i are sampled from the vocabulary V by: Emphasizing Domain-specific Tokens.Given the domain-specific tokens, we propose to emphasize them by assigning them a higher loss weight in the discriminator loss.Suppose ω i is the loss weight assigned to the token x i , then we set ω i = 1 + β if x i is domain-specific, and ω i = 1 otherwise, where β is the augmented loss weight.Based on Eq. 2, the loss weight of x i is formulated as: The re-weighted loss function based on Eq. 1 is (5) Training Framework.SODA continues to pretrain ELECTRA on the domain corpus using different snapshots for each training interval.Assume the minimum gap between the snapshot and the current generator is W, and the snapshot interval is T. Our strategy is to utilize the snapshot taken at step nT to assist token selection during the interval from W + nT to W + (n + 1)T.
For example, at training step W + nT, the generator at step nT is loaded as the snapshot to assist token selection.As the training progresses, we expect that the selection should prefer the tokens that are more specific to the current domain shift (van der Wees et al., 2017).Therefore, at the beginning of the next interval (step W + (n + 1) T), we replace the snapshot with the generator at the step (n+1)T which is closer to the current generator.

Datasets
We use the same pre-training corpora as AdaLM (Yao et al., 2021); the computer science corpus is collected from arXiv 1 and the biomedical corpus is the latest collection from PubMed 2 .For the downstream tasks, we use ACL-ARC (Jurgens et al., 2018) and SCIERC (Luan et al., 2018) for computer science, chemProt (Kringelum et al., 2016) and RCT (Dernoncourt and Lee, 2017) for biomedical domain.Specifications of these datasets are shown in Appendix A.

Implementation
We use ELECTRA BASE (Clark et al., 2020) as our source model.Our pre-training code is built upon Fairseq3 .Detailed experimental settings of the continued pre-training are listed in Appendix B. For snapshot settings, the minimum gap W is 30K steps.We set the interval T as 35K steps for computer science domain and 20K steps for biomedical domain.Analysis of the snapshot interval is in Section 4.4.In the re-weighed loss function, the augmented loss weight β is 0.2 for computer science domain and 0.5 for biomedical domain.We recommend using β values less than 1 because a too high β value will negatively impact the learning of other tokens.
Our fine-tuning code is based on AdaLM4 .We run hyperparameter search to find the bestperforming models.The search settings and results are listed in Appendix B.

Main results
We present the downstream tasks results of different competitive methods in Table 1.For each of the source models, the first row is the general model without continued pre-training, and DAPT (Gururangan et al., 2020) is the vanilla continued pretraining on the domain corpus.
SODA achieves the best performances across the tasks in both domains when ELECTRA is the source model, demonstrating the effectiveness of emphasizing domain-specific tokens in the continued pre-training.ELECTRA outperforms BERTand RoBERTa-based methods through vanilla continued pre-training, which suggests the great potential of ELECTRA as the source model for domain adaptation.We also observe that whether to randomly initialize the generator has an insignificant impact on the continued pre-training of ELECTRA: DAPT with Random G performs better than DAPT in biomedical domain but worse than DAPT in computer science domain.

Ablation analysis
Token Selection Method.We compare our snapshot-guided token selection method with three alternatives: (1) Rand: randomly select 10% input tokens; (2) Know: select the tokens that are wrongly predicted by generator, because such tokens contain more knowledge (Wang et al., 2022); (3) Freq: select the tokens with high frequency differences between the target and source domain corpus, where we use Wikipedia corpus (Zhu et al., 2015) to represent the source domain.As shown in Table 2, SODA consistently outperforms all the alternatives.This suggests that dynamics is a crucial factor for token selection in domain adaptation.
Snapshot Interval Length.We test the snapshot interval lengths T at 70K, 35K, 20K steps to analyze the effects.As shown in Figure 2, compared to 70K, a relatively shorter interval (i.e., 35K for computer science and 20K for biomedical domain) can improve performance, because the shorter interval makes it possible to change the snapshots more frequently, so as to dynamically select the tokens that are more specific to the domain shift at hand.We also record ratios of the selected tokens of all masked tokens in each interval.Figure 3 shows the records of the best-performing models in computer science and biomedical domains, of which the in-  terval lengths are 35K and 20K respectively.The snapshot is loaded for the first time at step 30K, at which the ratio jumps from 0 to a high value.
After that, the ratio drops every T steps until the end (95K).This is intuitive because as the model converges, there should be fewer domain-specific tokens to learn.
Domain Specificity.In this paper, we use domain specificity to summarize the prediction differences due to different pre-training steps for domain adaptation.We also analyze the domainspecificity from the perspective of token frequency: first, make a domain-specific token set of the tokens with high frequency differences between the target and source (Wikipedia (Zhu et al., 2015)) domains.Second, measure specificity by calculating the ratio of the tokens belonging to the specific token set.Specificity of the predicted tokens by the snapshot and generator, and specificity of the selected tokens by SODA are as in Table 3.
From the results, specificity of the generator is higher than that of the snapshot, and most tokens selected by SODA belong to the domain-specific token set.We also filter out the tokens not in the specific token set to check their impacts.From the results, filtering out such tokens leads to performance degradation, with an average score drop of 0.50 in the computer domain and 0.14 in the biomedical domain, proving that SODA could dynamically select tokens that are beneficial to continued pre-training, even if some of them are not specific in terms of token frequency.

Case study
We conduct case study to analyze the tokens selected at different training steps.Table 4 shows the results.We observe that the snapshot could help find the domain-specific tokens such as "paper" and "sorting" in the computer science domain and "inhibit" and "chemical" in the biomedical domain.Compared with the tokens selected at step 50K, the tokens at step 85K are fewer and more domain-specific, which proves SODA could dynamically select the domain-specific tokens as the training progresses.

Conclusion
In this paper, we design a snapshot-guided domain adaptation method for ELECTRA to capture the token-level domain knowledge by comparing generators of different training steps.Our method can dynamically select and emphasize the domainspecific tokens, which can benefit domain adaptation.Experimental results show that our method achieves state-of-the-art results without introducing additional training parameters.

Limitations
We only conduct experiments on ELECTRA, future research may experiment on BERT-style models by replacing the discriminator with a BERT-style model to effectively adapt the model.Besides, we set the loss weights for the domain-specific tokens as static values, future research may explore dynamic loss weights to improve the performance.

Figure 2 :Figure 3 :
Figure 2: Experiments with different snapshot intervals.The Y-axis represents the average score of tasks in the domain.

Table 2 :
Results of different token selection methods.We report averages across five random seeds, with standard deviations as subscripts.

Table 3 :
Specificity of the predicted tokens by the snapshot and generator, and specificity of the selected tokens by SODA.

Table 4 :
The tokens selected at training step 50K/85K (shown in red) of the computer science and biomedical domains.Input is the original text, where tokens in boldface are masked.S and G stand for the snapshot and the generator respectively, with the trained steps as subscripts.