VarMAE: Pre-training of Variational Masked Autoencoder for Domain-adaptive Language Understanding

Pre-trained language models have achieved promising performance on general benchmarks, but underperform when migrated to a specific domain. Recent works perform pre-training from scratch or continual pre-training on domain corpora. However, in many specific domains, the limited corpus can hardly support obtaining precise representations. To address this issue, we propose a novel Transformer-based language model named VarMAE for domain-adaptive language understanding. Under the masked autoencoding objective, we design a context uncertainty learning module to encode the token's context into a smooth latent distribution. The module can produce diverse and well-formed contextual representations. Experiments on science- and finance-domain NLU tasks demonstrate that VarMAE can be efficiently adapted to new domains with limited resources.


Introduction
Pre-trained language models (PLMs) have achieved promising performance in natural language understanding (NLU) tasks on standard benchmark datasets (Wang et al., 2018;Xu et al., 2020).Most works (Devlin et al., 2019;Liu et al., 2019) leverage the Transformer-based pre-train/fine-tune paradigm to learn contextual embedding from large unsupervised corpora.Masked autoencoding, also named masked language model in BERT (Devlin et al., 2019), is a widely used pre-training objective that randomly masks tokens in a sequence to recover.The objective can lead to a deep bidirectional representation of all tokens in a BERT-like architecture.However, these models, which are pre-trained on standard corpora (e.g., Wikipedia), tend to underperform when migrated to a specific domain due to the distribution shift (Lee et al., 2020).
Recent works perform pre-training from scratch (Gu et al., 2022;Yao et al., 2022) or continual pre-training (Gururangan et al., 2020;Wu et al., 2022) on large domain-specific corpora.But in many specific domains (e.g., finance), effective and intact unsupervised data is difficult and costly to collect due to data accessibility, privacy, security, etc.The limited domain corpus may not support pre-training from scratch (Zhang et al., 2020), and also greatly limit the effect of continual pre-training due to the distribution shift.Besides, some scenarios (i.e., non-industry academics or professionals) have limited access to computing power for training on a massive corpus.Therefore, how to obtain effective contextualized representations from the limited domain corpus remains a crucial challenge.
Relying on the distributional similarity hypothesis (Mikolov et al., 2013a) in linguistics, that similar words have similar contexts, masked autoencoders (MAEs) leverage co-occurrence between the context of words to learn word representations.However, when pre-training on the limited corpus, most word representations can only be learned from fewer co-occurrence contexts, leading to sparse word embedding in the semantic space.Besides, in the reconstruction of masked tokens, it is difficult to perform an accurate point estimation (Li et al., 2020) based on the partially visible context for each word.That is, the possible context of each token should be diverse.The limited data only provides restricted context information, which causes MAEs to learn a relatively poor context representation in a specific domain.
To address the above issue, we propose a novel Variational Masked Autoencoder (VarMAE), a regularized version of MAEs, for a better domainadaptive language understanding.Based on the vanilla MAE, we design a context uncertainty learning (CUL) module for learning a precise context representation when pre-training on a limited corpus.Specifically, the CUL encodes the token's point-estimate context in the semantic space into a smooth latent distribution.And then, the module  reconstructs the context using feature regularization specified by prior distributions of latent variables.In this way, latent representations of similar contexts can be close to each other and vice versa (Li et al., 2019).Accordingly, we can obtain a smoother space and more structured latent patterns.We conduct continual pre-training on unsupervised corpora in two domains (science and finance) and then fine-tune on the corresponding downstream NLU tasks.The results consistently show that VarMAE outperforms representative language models including vanilla pre-trained (Liu et al., 2019) and continual pre-training methods (Gururangan et al., 2020), when adapting to new domains with limited resources.Moreover, compared with masked autoencoding in MAEs, the objective of VarMAE can produce a more diverse and well-formed context representation.

VarMAE
In this section, we develop a novel Variational Masked Autoencoder (VarMAE) to improve vanilla MAE for domain-adaptive language understanding.The overall architecture is shown in Figure 1.Based on the vanilla MAE, we design a context uncertainty learning (CUL) module for learning a precise context representation when pre-training on a limited corpus.

Architecture of Vanilla MAE
Masking We randomly mask some percentage of the input tokens and then predict those masked tokens.Given one input tokens X = {x 1 , ..., x n } and n is the sentence length, the model selects a random set of positions (integers between 1 and n) to mask out M = {m 1 , ..., m k }, where k = ⌈0.15n⌉indicates 15% of tokens are masked out.The tokens in the selected positions are replaced with a [MASK] token.The masked sequence can be denoted as X masked = REPLACE(X, M, [MASK]).
Transformer Encoder Vanilla MAE usually adopts a multi-layer bidirectional Transformer (Vaswani et al., 2017) as basic encoder like previous pre-training model (Liu et al., 2019).Transformer can capture the contextual information for each token in the sentence via self-attention mechanism, and generate a sequence of contextual embeddings.Given the masked sentence X masked , the context representation is denoted as C = {c 1 , ..., c N }.

Language Model Head
We adopt the language model (LM) head to predict the original token based on the reconstructed representation.The number of output channels of LM head equals the number of input tokens.Based on the context representation c i , the distribution of the masked prediction is estimated by: p θ (x i |c i ) = sof tmax(Wc i + b), where W and b denote the weight matrices of one fully-connected layer.θ refers to the trainable parameters.The predicted token can be obtained by x ′ = arg max i p θ (x i |c i ), where x ′ denotes the predicted original token.

Context Uncertainty Learning
Due to the flexibility of natural language, one word may have different meanings under different domains.In many specific domains, the limited corpus can hardly support obtaining precise representations.To address this, we introduce a context uncertainty learning (CUL) module to learn regularized context representations for all tokens.These tokens include masked tokens with more noise and unmasked tokens with less noise.Inspired by variational autoencoders (VAEs) (Kingma and Welling, 2014;Higgins et al., 2017), we use latent variable modeling techniques to quantify the aleatoric uncertainty1 (Der Kiureghian and Ditlevsen, 2009;Abdar et al., 2021) of these tokens.
Let us consider the input token x is generated with an unobserved continuous random variable z.We assume that x i is generated from a conditional distribution p θ (x|z), where z is generated from an isotropic Gaussian prior distribution p θ (z) = N (z; 0, I).To learn the joint distribution of the observed variable x and its latent variable factors z, the optimal objective is to maximize the marginal log-likelihood of x in expectation over the whole distribution of latent factors z: (1) Since masked and unmasked tokens have relatively different noise levels, the functions to quantify the aleatoric uncertainty of these two types should be different.We take CUL for masked tokens as an example.Given each input masked token x m i and its corresponding context representation c m i , the true posterior p θ (z m |x m i ) is approximated as p θ ′ (z m |c m i ) due to the distributional similarity hypothesis (Mikolov et al., 2013a).Inspired by Kingma and Welling (2014), we assume p θ ′ (z m |c m i ) takes on an approximate Gaussian form with a diagonal covariance, and let the variational approximate posterior be a multivariate Gaussian with a diagonal covariance structure.This variational approximate posterior is denoted as q ϕ (z m |c m i ): where I is diagonal covariance, ϕ is the variational parameters.Both parameters (mean as well as variance) are input-dependent and predicted by MLP (a fully-connected neural network with a single hidden layer) , i.e., , where ϕ µ and ϕ σ refer to the model parameters respectively w.r.t output µ m i and σ m i .Next, we sample a variable z m i from the approximate posterior q ϕ (z m |c m i ), and then feed it into the LM head to predict the original token.
Similarly, CUL for each unmasked token x m i adopts in a similar way and samples a latent variable z m i from the variational approximate posterior , where µ m i and σ m i are predicted by MLP.
In the implementation, we adopt f ϕµ with shared parameters to obtain µ m and µ m.Conversely, two f ϕσ with independent parameters are used to obtain σ m and σ m, for x m with more noise and x m with less noise, respectively.After that, batch normalization (Ioffe and Szegedy, 2015) is applied to avoid the posterior collapse2 (Zhu et al., 2020).By applying the CUL module, the context representation is not a deterministic point embedding any more, but a stochastic embedding sampled from N (z; µ, σ 2 I) in the latent space.Based on the reconstructed representation, the LM head is adopted to predict the original token.

Training Objective
To learn a smooth space where latent representations of similar contexts are close to each other and vice versa, the objective function is: where δ > 0 is a constraint, and q ϕ (z|c) is the variational approximate posterior of the true posterior p θ (z|x) (see Section 2.2).D KL (•) denotes the KL-divergence term, which serves as the regularization that forces prior distribution p θ to approach the approximated posterior q ϕ .Then, for each input sequence, the loss function is developed as a weighted sum of loss functions for masked tokens L m and unmasked tokens L m.The weights are normalization factors of masked/unmasked tokens in the current sequence.
where λ m and λ m are trade-off hyper-parameters.
Please see Appendix B for more details.As the sampling of z i is a stochastic process, we use re-parameterization trick (Kingma and Welling, 2014) to make it trainable: z i = µ i + σ i ⊙ ϵ, ϵ ∼ N (0, I), where ⊙ refers to an element-wise product.Then, KL term D KL (•) is computed as: ( For all tokens, the CUL forces the model to be able to reconstruct the context using feature regularization specified by prior distributions of latent variables.Under the objective of VarMAE, latent vectors with similar contexts are encouraged to be smoothly organized together.After the pre-training, we leverage the Transformer encoder and f ϕµ to fine-tune on downstream tasks.

Experiments
We conduct experiments on science-and financedomain NLU tasks to evaluate our method.

Experimental Setup
We compare VarMAE with the following baselines: RoBERTa (Liu et al., 2019) is an optimized BERT with a masked autoencoding objective, and is to directly fine-tune on given downstream tasks.TAPT (Gururangan et al., 2020)  Experiments are conducted under PyTorch7 and using 2/1 NVIDIA Tesla V100 GPUs with 16GB memory for pre-training/fine-tuning.During pretraining, we use roberta-base8 and chinese-robertawwm-ext 8 to initialize the model for science (English) and finance domains (Chinese), respectively.During the pre-training of VarMAE, we freeze the embedding layer and all layers of Transformer encoder to avoid catastrophic forgetting (French, 1993;Arumae et al., 2020) of previously general learned knowledge.And then we optimize other network parameters (e.g., the LM Head and CUL module) by using Adam optimizer (Kingma and Ba, 2015) with the learning rate of 5e −5 .The number of epochs is set to 3. We use gradient accumulation step of 50 to achieve the large batch sizes (i.e., the batch size is 3200).The trade-off co-  efficient λ is set to 10 for both domains selected from {1, 10, 100}.For fine-tuning on downstream tasks, most hyperparameters are the same as in pretraining, except for the following settings due to the limited computation.The batch size is set to 128 for OIR, and 32 for other tasks.The maximum sequence length is set to 64 for OIR, and 128 for other tasks.The number of epochs is set to 10.More details are listed in Appendix C.2.

Results and Analysis
Table 1 shows the results on science-and financedomain downstream tasks.In terms of the average result, VarMAE yields 1.41% and 3.09% absolute performance improvements over the best-compared model on science and finance domains, respectively.It shows the superiority of domain-adaptive pretraining with context uncertainty learning.DAPT and TAPT obtain inferior results.It indicates that the small domain corpus limits the continual pretraining due to the distribution shift.
We report the average results on all tasks against different corpus sizes of pre-training in Table 2 (see Appendix D.1 for details).VarMAE consistently achieves better performance than DAPT even though a third of the corpus is used.When using full corpus, DPAT's performance decreases but Var-MAE's performance increases, which proves our method has a promising ability to adapt to the target domain with a limited corpus.
Table 3 shows the average results of VarMAE on all tasks against different masking ratios of pretraining (see Appendix D.2 for details).Under the default masking strategies 9 , the best masking rate is 15%, which is the same as BERT and RoBERTa. 980% for replacing the target token with [MASK] symbol, 10% for keeping the target token as is, and 10% for replacing the target token with another random token.

Case Study
As shown in Table 4, we randomly choose several samples from the test set in the multi-label topic classification (MTC) task.
For the first case, RoBERTa and DAPT each predict one label correctly.It shows that both general and domain language knowledge have a certain effect on the domain-specific task.However, none of them identify all the tags completely.This phenomenon reflects that the general or limited continual PLM is not sufficient for the domain-specific task.For the second and third cases, these two comparison methods cannot classify the topic label Risk education and Critical illness, respectively.It indicated that they perform an isolated point estimation and have a relatively poor context representation.Unlike other methods, our VarMAE can encode the token's context into a smooth latent distribution and produce diverse and well-formed contextual representations.As expected, VarMAE predicts the first three examples correctly with limited resources.
For the last case, all methods fail to predict Critical illness.We notice that ABC Comprehensive Care Program is a product name related to critical illness insurance.Classifying it properly may require some domain-specific structured knowledge.

Conclusion
We propose a novel Transformer-based language model named VarMAE for domain-adaptive language understanding with limited resources.A new CUL module is designed to produce a diverse and well-formed context representation.Experiments on science-and finance-domain tasks demonstrate that VarMAE can be efficiently adapted to new domains using a limited corpus.Hope that VarMAE can guide future foundational work in this area.

Limitations
All experiments are conducted on a small pretraining corpus due to the limitation of computational resources.The performance of VarMAE pretraining on a larger corpus needs to be further studied.Besides, VarMAE cannot be directly adapted to downstream natural language generation tasks since our model does not contain a decoder for the generation.This will be left as future work.

A Related Work
A.1 General PLMs Traditional works (Mikolov et al., 2013b;Pennington et al., 2014) represent the word as a single vector representation, which cannot disambiguate the word senses based on the surrounding context.Recently, unsupervised pre-training on large-scale corpora significantly improves performance, either for Natural Language Understanding (NLU) (Peters et al., 2018;Devlin et al., 2019;Cui et al., 2021) or for Natural Language Generation (NLG) (Raffel et al., 2020;Brown et al., 2020;Lewis et al., 2020).Following this trend, considerable progress (Liu et al., 2019;Yang et al., 2019;Clark et al., 2020;Joshi et al., 2020;Wang et al., 2020;Diao et al., 2020) has been made to boost the performance via improving the model architectures or exploring novel pre-training tasks.Some works (Sun et al., 2019;Zhang et al., 2019;Qin et al., 2021) enhance the model by integrating structured knowledge from external knowledge graphs.
Due to the flexibility of natural language, one word may have different meanings under different domains.These methods underperform when migrated to specialized domains.Moreover, simple fine-tuning (Howard and Ruder, 2018;Hu and Wei, 2020;Wei et al., 2020;Hu et al., 2022) of PLMs is also not sufficient for domain-specific applications.
Remarkably, Beltagy et al. (2019); Chalkidis et al. (2020) explore different strategies to adapt to new domains, including pre-training from scratch and further pre-training.Boukkouri et al. (2022) find that both of them perform at a similar level when pre-training on a specialized corpus, but the former requires more resources.Yao et al. (2022) jointly optimize the task and language modeling objective from scratch.Zhang et al. (2020); Tai et al. (2020); Yao et al. (2021) extend the vocabulary of the LM with domain-specific terms for further gains.Gururangan et al. (2020) show that domainand task-adaptive pre-training methods can offer gains in specific domains.Qin et al. (2022) present an efficient lifelong pre-training method for emerging domain data.
In most specific domains, collecting large-scale corpora is usually inaccessible.The limited data makes pre-training from scratch infeasible and restricts the performance of continual pre-training.Towards this issue, we investigate domain-adaptive language understanding with a limited target corpus, and propose a novel language modeling method named VarMAE.The method performs a context uncertainty learning module to produce diverse and well-formed contextual representations, and can be efficiently adapted to new domains with limited resources.

B Derivation of Objective Function
Here, we take the objective for masked tokens as the example to give derivations of the loss function.The objective for unmasked tokens is similar.For simplifying description, we omit the superscripts that use to distinguish masked tokens from unmasked tokens.To learn a smooth space of masked tokens where latent representations of similar contexts are close to each other and vice versa, the objective function is: s.t.D KL (q ϕ (z|c)∥p θ (z)) < δ, where δ > 0 is a constraint, and q ϕ (z|c) is the variational approximate posterior of the true posterior p θ (z|x) (see Section 2.2).D KL (•) denotes the KLdivergence term, which serves as the regularization that forces the prior distribution p θ to approach the approximated posterior q ϕ .In order to encourage this disentangling property in the inferred (Higgins et al., 2017), we introduce a constraint δ over q ϕ (z|c) by matching it to a prior p θ (z).The objective can be computed as a Lagrangian under the KKT condition (Bertsekas, 1997;Karush, 2014).The above optimization problem with only one inequality constraint is equivalent to maximizing the following equation, F(θ, ϕ, λ; c, z) = E z∼q ϕ (z|c) [log p θ (x|z)] −λ(D KL (q ϕ (z|c)∥p θ (z)) − δ), (7) is a continual pre-training model on a task-specific corpus.DAPT (Gururangan et al., 2020) is a continual pre-training model on a domain-specific corpus.

Table 1 :
Results on science-and finance-domain downstream tasks.All compared pre-trained models are fine-tuned on the task dataset.For each dataset, we run three random seeds and report the average result of the test sets.We report the micro-average F1 score for CLS and TM, entity-level F1 score for NER, and token-level F1 score for SE.Best results are highlighted in bold.

Table 2 :
Average results on all downstream tasks against different corpus sizes of pre-training.|D| is the corpus size for corresponding domain.

Table 3 :
Average results of VarMAE on all downstream tasks against different masking ratios of pre-training.

Table 4 :
Case studies in the multi-label topic classification (MTC) task of financial business scenarios.The table shows four examples of spoken dialogues in the test set, their gold labels and predictions by three methods(RoBERTa, DAPT and VarMAE).We translate original Chinese to English version for readers.