Unsupervised Energy-based Adversarial Domain Adaptation for Cross-domain Text Classification

Transferring knowledge from a label-rich domain (source domain) to a label-scarce domain (target domain) for pervasive cross-domain Text Classiﬁcation (TC) is a non-trivial task. To overcome this issue, we propose EADA, a novel unsupervised energy-based adversarial domain adaptation framework. First, a deep pre-trained language model (e.g. RoBERTa) is leveraged as a shared feature extractor that maps the text sequences from both source and target domains to a feature space. Since the source features maintain good feature discrim-inability because of the full supervised training, we design a method that encourages target features towards the source ones via adversarial learning. An autoencoder is designed as an energy function that focuses on reconstructing source feature embeddings, while the feature extractor aims to generate source-like target feature embeddings to deceive the autoencoder. In this manner, the target feature embeddings become domain-invariant and inherit great discriminability. Extensive experiments on multi-domain sentiment classiﬁcation (Amazon review dataset) and Yes/No question-answering classiﬁcation (BoolQ and MARCO dataset) are conducted. The experimental results validate that EADA largely alleviates the domain discrepancy while maintaining excellent discriminability and achieves state-of-the-art cross-domain TC performance.


Introduction
With the booming development of Natural Language Processing (NLP) in recent years, text classification (TC) is playing a vital role in a myriad of services in our daily lives, such as online recommendations, email spam detection, sentiment classification and social media analysis. Large pretrained language models, e.g. BERT (Devlin et al., * * Equal Contribution. † † Work done while at Microsoft. 2019), XLNet  and RoBERTa (Liu et al., 2019b), achieve outstanding results on challenging NLP benchmarks, i.e. GLUE (Wang et al., 2018), RACE (Lai et al., 2017), and SQuAD (Rajpurkar et al., 2016). These models enable numerous downstream NLP tasks with compelling performance, including TC, where the model is further fine-tuned with annotated data. TC tasks are usually domain dependent in realworld. Thus, the performance of these powerful deep models is still fluctuated and even degraded when directly implementing them in a unseen domain (target domain), where the task topic or the data distributions are different from the domain during training (source domain). Although their performance can be improved via fine-tuning with full supervision in the target domain, a significant amount of labeled target data is required. Collecting high-quality data is usually difficult and expensive in many real-world domains. Furthermore, the annotating process is extremely time-consuming and labor-intensive. To overcome these issues, unsupervised domain adaptation (UDA), which aims to transfer the knowledge from a label-rich domain (source domain) to a label-scarce or unlabeled domain (target domain) is proposed (Li et al., 2017;Guo et al., 2018;. The intuitive objective of UDA is to align the marginal distribution of features across source and target domains. In general, UDA methods can be classified into two categories. One line of research focuses on reducing the discrepancy by minimizing statistical measurements, e.g. maximum mean discrepancy (Tzeng et al., 2014a). Another category leverages adversarial learning to alleviate the domain shift. Motivated by Generative Adversarial Network (GAN) (Goodfellow et al., 2014), adversarial domain adaptation (ADA) introduces a binary domain discriminator to identify the domain label of the data, while an encoder learns to fool the discriminator. ADA has achieved encouraging results on nontrivial DA problems across various applications, such as image classification (Vu et al., 2019;Yang et al., 2020b), human activity recognition (Zou et al., 2019;, Internet of Things (Yang et al., 2020a), and also text classification (Li et al., 2017;. For instance, AMN (Li et al., 2017) trains a sentiment classifier and a domain discriminator to reduce the domain discrepancy. ADAN  exploits adversarial learning for crosslingual sentiment classification. HAGAN  integrates the hierarchical attention mechanism with ADA to obtain features that are sentiment distinguishable but domain indistinguishable.
Although these ADA methods achieve good results in certain cross-domain TC tasks, one major issue is the unstable prediction performance in the target domain Saito et al., 2018). After the adversarial training achieves convergence, the conventional binary domain discriminator cannot distinguish the domain label of the feature representations, which means these representations obtain good transferability. However, there is no constraint on the discriminability in the target domain. The model can generate trivial but useless target feature representations as long as they can fool the domain discriminator. Thus, this uncertainty in adversarial training deteriorates the discriminability of the target feature representations and ignores the decision boundary learned in the source domain, which leads to unstable and even poor prediction performance in the target domain (Chen et al., 2019a;Cui et al., 2020). Some works aim to adjust the decision boundary of the label classifier (Saito et al., 2018;Shu et al., 2018) or align additional semantic information  to overcome this issue during adversarial training. However, these additional learning steps either require a sophisticated hyper-parameter tuning process or increase the computational overhead, that limits the generalization capability of the ADA methods for NLP tasks. Therefore, a simple yet efficient solution is urgently desired.
In this paper, we propose EADA, an energybased adversarial domain adaptation framework that tackles the uncertainty issue during adversarial learning and dedicates for text classification tasks. EADA consists of three modules, a shared feature extractor, a label predictor, and an autoencoder. We employ a deep pre-trained language model (RoBERTa) as a shared feature extractor that maps the text sequences from both source domain and target domain into a latent feature space. With the labeled source data, the feature extractor and the label predictor are fine-tuned under full supervision. Since the source feature representations generated from the feature extractor contain superb discriminability, the innovative goal of EADA is to fix these source features by adding constrains in the objective and only force the target feature distribution to align the source feature distribution through adversarial training so that the target features could remain discriminative, and the label predictor could also perform well in the target domain. Since autoencoder is acknowledged as an energy function that learns to map the observed sample to the lowenergy space (LeCun et al., 2006), we design an autoencoder that leverages this property to fix the source features by associating lower energies to it while pushing the target domain to the low-energy space by minimizing the margin loss of the autoencoder. Meanwhile, it can also cluster similar data to form a high-density manifold, which helps to preserve more semantic information. We train the autoencoder to reconstruct the source features and train the feature extractor to generate source-like target features to deceive the autoencoder via a minimax with a margin loss. In summary, we make the following contributions: • To address the problem of conventional binary domain discriminator that deteriorates the discriminability of the target feature presentation, we propose a novel autoencoder module, which forces the target feature representations to simulate source feature representations such that good discriminability can be inherited.
• As an energy function, the autoencoder maps features from both domains to the low-energy space, which motivates the feature clusters to be tight in an unsupervised manner. It improves the label classification accuracy in the target domain.
• Extensive experiments on public crossdomain TC benchmark datasets, including multi-domain sentiment classification (Amazon review dataset) and cross-domain Yes/No question-answering (QA) classification (BoolQ and MARCO dataset), are conducted. The experimental results demonstrate that EADA alleviates the uncertainty during adversarial training and enhances the feature discriminability in the target domain. This enables EADA to outperform existing methods and achieve new state-of-the-art ADA results for cross-domain TC tasks without requiring any labeled data in the target domain.
The rest of the paper is organized as follows. Section 2 summarizes the existing domain adaptation methods for TC tasks. The limitation of existing ADA Methods is elaborated in Section 3. Section 4 presents the framework architecture of EADA. In Section 5, we present the experimental results and performance evaluation. We conclude our work in Section 6.

Related Work
Domain Adaptation aims to tackle the domain shift issue when the data distribution in the source domain and target domain are different (Ben-David et al., 2010). Unsupervised domain adaptation (UDA) aims to learn a model that is able achieve good classification accuracy without any annotation in the target domain (Tzeng et al., 2017;Zhao et al., 2019). Certain statistical measurements, such as maximum mean discrepancy (MMD) (Tzeng et al., 2014b;Ma et al., 2019), are leveraged to quantify the distribution differences. Inspired by the recent success of Generative Adversarial Network (GAN) (Goodfellow et al., 2014) for data generation, researchers have proposed adversarial domain adaptation (ADA), that constructs an adversarial loss to accommodate the domain shift. It consists of an encoder and a domain discriminator. The generator aims to fool the discriminator to make the target domain samples look like the source domain ones, while the discriminator tries to identify the domain labels (source or target). ADDA (Tzeng et al., 2017) learns a discriminative representation using the labels in the source domain and then a separate encoder that maps the target data to the same space using an asymmetric mapping learned through a standard GAN loss without weights sharing. CoGAN (Liu and Tuzel, 2016) trains 2 GANs to synthesize both source and target images and achieves a domain invariant feature space by tying the high-level layer parameters of the 2 GAN to solve the domain transfer problem.
ADA has been adopted for cross-domain NLP tasks as well (Peng et al., 2018;Li et al., 2017;Cai and Wan, 2019;. AMN (Li et al., 2017) is an end-to-end adversarial memory network for cross-domain sentiment classification, which is the pioneering work for ADA in NLP. An adversarial deep averaging network is proposed in  for cross-lingual sentiment classification. A dedicated ADA framework for machine reading comprehension is proposed in . ) designed an ADA model that learns domain invariant representation across multiple domains for text classification. Target domain-specific information is being exploited in (Peng et al., 2018) to further improve the DA performance, while labeled data in the target domain is required.
Large deep pre-trained language models pioneered by BERT (Devlin et al., 2019), have been employed as feature encoders to embed text sequences into a latent feature space. Then, the encoder is further fine-tuned with the discriminator via adversarial learning using the labeled source data and unlabeled target data Ma et al., 2019). For instance, BERT and ADA were adopted in  for domain-agnostic question-answering. A similar framework that integrates BERT and MMD is proposed in (Ma et al., 2019) for cross-domain sentiment classification. However, all these approaches leverage the binary domain discriminator which has failed to consider the discriminative features during feature learning. This leads to severe performance degradation since the decision boundary of the label predictor trained with source data is no longer valid in the target domain due to the domain shift.

Limitation of Existing ADA Methods
In this section, we analyze the learning process of conventional ADA methods and reveal their limitations. In common UDA setup, The distributions of D S and D T are different due to the domain discrepancy. UDA aims to build up a model that provides good class prediction in both source and target domain. Discriminability of the feature representation is the clustering capacity in the feature manifold, that controls the easiness of class category separation Figure 1: EADA constitutes: a pre-trained language model as shared feature extractor G f , a label predictor G y and an autoencoder G a . In addition to the full supervised learning of G f and G y with the labeled source data, the autoencoder G a serves as a domain classifier to learn reconstructing the source feature representations and push the target feature representations away. The feature extractor G f aims to generate source-like target feature representations to deceive the autoencoder. This objective is realized by forcing the target feature representations towards the those from the source domain in the feature space via adversarial learning. . Excellent discriminability can be achieved for the source features due to the full supervision learning in the source domain. The objective of UDA is to transfer and ensure the model maintains this discriminability in the target domain.
ADA method, as one category of UDA, which is pioneered by Domain-Adversarial Training of Neural Networks (DANN) (Ganin et al., 2016) and Adversarial Memory Network (AMN) (Li et al., 2017) have shown promising performance in numerous NLP tasks in recent years Cai and Wan, 2019;. It usually consists of a shared feature extractor f = G f (x), a label predictor y = G y (x) and a domain discriminator d = G d (x). In addition to the standard full supervision learning process in the source domain, a minimax game is designed between f and d. The domain discriminator d aims to distinguish the domain label between source and target, meanwhile the feature extractor f is trained to deceive d. This adversarial training process can be formulated as where L y is the cross-entropy classification loss. In this manner, the model can learn domain-invariant features and transfer them across domains when the Nash Equilibrium is achieved (Zhao et al., 2017). The hyper-parameter γ controls the significance of adversarial training that improves transferability. As shown in Eq(1), the training process of feature extractor f of conventional ADA methods aims to achieve two tasks: (1) learn source representations with good discriminability; (2) train representations that are indistinguishable to the domain discriminator d. Since both source domain and target domain data are involved in the adversarial feature learning as presented in the second term of Eq(1), the objective is equivalent to move two domains closer in the feature space to deceive d. However, this process does not impose any constraint on the discriminability in the target domain. The feature extractor f can generate trivial but useless target representations as long as they can fool the discriminator d. Therefore, these ADA methods cannot guarantee that the good decision boundary learned via full supervision in the source domain can still separate the categorical clusters in the target domain Liu et al., 2019a). This degradation of discriminability in the target domain is the major reason that hinders the performance of existing ADA methods.

Energy-based Adversarial Domain Adaptation
It is not a trivial task to maintain the source manifolds during adversarial training. Our solution is to decouple the adversarial training process of source and target feature representations. To be specific, we fix the source representation in the feature space and only encourage the target representations align to the source representations. Therefore, the superb discriminability learned in the source domain can be preserved and a label predictor that performs well in both source and target domain can be obtained.
To achieve this goal, we propose Energy-based Adversarial Domain Adaptation (EADA), which innovatively utilizes an autoencoder structure as a domain discriminator during adversarial training. Figure 1 demonstrates the model structure of EADA. It consists of three modules, a pre-trained language model as a shared feature extractor G f parameterized by θ f to embed input sample to feature embedding z. After that, a label predictor G y parameterized by θ y , which consists of several fully connected layers, further maps the feature embedding z to the predicted labelŷ. Another module is an autoencoder G a parameterized by θ a , that reconstructs a feature embedding z toẑ. The detailed functionality of each module is elaborated as follows.

Shared Feature Extractor
Large pre-trained language models (e.g. BERT (Devlin et al., 2019), XLNet  and RoBERTa (Liu et al., 2019b)) have achieved a series of state-of-the-art results on NLP benchmarks. These powerful pre-trained language models are built up on bidirectional transformer architecture, and pre-trained on large corpora with a masked language model, that enable various downstream NLP tasks, including text classification.
In this work, we employ RoBERTa as the shared feature extractor G f (highlighted in blue in Figure  1) that embeds both labeled text data (X s ) from the source domain, as well as the unlabeled text data (X t ) from the target domain into a latent feature space. To be specific, as a QnA classification problem, the input for the G f are sequence pairs <query Q, passage P> as depicted in Figure 1 in the format of [CLS] <s> Q </s> <s> P </s>, where [CLS] is a dummy token for classification and <s> </s> are separator tokens. We leverage the roberta.base architecture (12-layer, 768-hidden, 12-heads, 125M parameters) (Liu et al., 2019b) as the shared feature extractor G f . Since our objective is text classification, the last hidden representation of the [CLS] token, H [CLS] ∈ R 768×1 (feature embedding z) serves as the output of G f . These embeddings z are utilized by both classifier G y and autoencoder G a .

Class Label Predictor
The class label predictor G y consists of several fully connected layers that map the feature embedding z to the predicted labelŷ. Since the source domain is label-rich by default, we assume that n labeled samples D S = {x s i , y s i } are available from the source domain for finetuning of the shared feature extractor (language model) G f (blue part in Figure 1) and the label predictor G y (green part in Figure 1). The good classification accuracy of G y is achieved by minimizing the cross-entropy loss via back-propagation under full supervision: (3)

Autoencoder as Domain Discriminator
After obtaining the source feature representation with good discriminability, the next task is to learn transferable features with k unlabeled samples from a target domain D T = {x t i }. To ensure both transferability and discriminability of the feature representation, we design an autoencoder G a with a margin Mean Squared Error (MSE) loss to replace the conventional binary domain discriminator. The MSE loss of the autoencoder is defined as: where || · || 2 2 denotes the squared L 2 -norm. Since the source embeddings z s always contain superb discriminability due to full supervision during the training of the classifier, z s should be fixed to preserve the good decision boundary, while the target embeddings z t should be encouraged to align with the distribution of z s . To achieve this goal, the autoencoder G a is designed to be able to only reconstruct features in the source domain but not features in the target domain. Namely, when two domains distribute similarly, the autoencoder will incur the same reconstruction loss in both domains. The training process of the autoencoder is formulated as: where m is the margin between the representations from the source domain and the target domain. The autoencoder G a can be considered as an energy function that associates lower energies to the observed samples in a binary classification problem (LeCun et al., 2006). With the inspiration of Energy-based GAN which theoretically proves that using an energy function in GAN, the true distribution can be simulated by the generator at Nash Equilibrium (Zhao et al., 2017). In EADA, the autoencoder module G a provides similar functionality that associates low energies to the source features (focuses on reconstructing source embeddings z s ). As presented in Eq (5), the training goal of the autoencoder is to have L AE (X s ) = 0 and L AE (X t ) = m. It behaves proportionally similar to a binary domain discriminator. But G a includes more domain information and can transfer it during adversarial training.

The Learning Framework
The adversarial training objective of three modules forms a minimax game, that is defined by: where γ is a hyper-parameter to control the effectiveness of G a . The shared feature extractor G f maps both labeled source data X s and unlabeled target data X t to a latent feature space. Both G f and the label predictor G y are trained with full supervision using the labeled data in the source domain. Another key role of the feature extractor G f is to deceive the autoencoder G a by generating source-like features for unlabeled target samples. Therefore, we only incorporate the L AE (X t ) term into the training of G f . The adversarial training of G f is formulated by: In the minimax game, the autoencoder G a aims to maximize the domain divergence by pushing two domains away from a margin m, while the objective of the feature extractor G f is to minimize the domain divergence by deceiving the autoencoder. When the model achieves convergence, the target feature representations inherit excellent discriminability from the source domain so that the generalization capability of the label predictor G y is improved and performs well not only in the source domain but also in the target domain.

Experiments
We evaluate the domain adaptation performance of EADA on two public real-world cross-domain text classification benchmarks: 1) sentiment classification (Amazon reviews dataset); 2) Natural QA Yes/No classification (BoolQ ⇔ MS Marco), and compared it with state-of-the-art baselines.

Evaluation on Sentiment Classification
Amazon reviews dataset (Pan et al., 2010) is the standard and well-known benchmark for sentiment classification domain adaptation. It contains reviews on four domains: Books (B), DVDs (D), Electronics (E), and Kitchen (K). Each domain contains 1000 positive reviews (higher than 3 stars) and 1000 negative reviews (3 stars or lower). 12 cross-domain sentiment classification tasks: D→B, E→B, K→B, K→E, D→E, B→E, B→D, K→D, E→D, B→K, D→K, E→K, where the letter before the arrow represents the source domain and the letter after the arrow indicates the target domain by following (Li et al., 2017). For each pair of domain adaptation, 800 labeled positive (Pos) and 800 labeled negative (Neg) reviews from the source domain (src), together with 1600 unlabeled reviews from the target domain (tgt) are randomly selected for training. The rest of 200 positive and 200 negative reviews from the target domain are used for testing. We configured the feature extractor module G f as RoBERTa.base for single sequences task since each review is one sequence passage. The input is tokenized as [CLS] <s> Review </s>. The maximum input sequence length is set to 256 tokens. The autoencoder module G a consists of 5 fully connected layers (768-384-96-384-768). The entire EADA framework is implemented in PyTorch. The Adam optimizer with the constant learning rate µ = 1e −5 with a batch size of 24 was adopted and we used 5-fold cross-validation to tune the hyperparameter m = 4 and γ = 1e −2 during the training.  Table 1 presents the mean accuracy with 5 runs of each method on the 12 DA tasks. One observation is that the accuracy of source-only RoBERTa (82.94%) is even worse than MoE (83.74%). This indicates that the problem of DA cannot be solved by just solely using large pre-trained language models. It can be observed that EADA provides 86.44% classification accuracy on average, which outperforms all the baselines. It achieves the best DA performance in 10 out of the 12 tasks. Although ADA-RoBERTa and EADA both adopted RoBERTa as the feature extractor, the accuracy of EADA is still 2.5% higher than ADA-RoBERTa, which validates the advantage of the proposed energy-based ADA method. Moreover, the variance of EADA is the smallest among all the methods, which indicates its performance is more stable in general. By learning a source-like representation for the target feature embeddings, EADA successfully performs crossdomain sentiment classification without any annotated data in target domain.

Evaluation on Yes/No QA classification
We also validated the performance of EADA on cross-domain naturally occurring yes/no questions between BoolQ dataset (Clark et al., 2019) and Marco dataset (Nguyen et al., 2016). Each example is a triplet of (query, passage, answer). Thus, feature extractor module G f (RoBERTa) is configured for sequence pairs task, which means the input is tokenized in the format as [CLS] <s> query </s> <s> passage </s>. BoolQ dataset contains 5874 Yes and 3553 No samples for training and 2033 Yes and 1237 No samples for evaluation from Wikipedia. Samples in Marco dataset are web snippets from Bing Search. There are 17339 Yes and 10550 No samples for training and 2033 Yes and 1237 No samples for testing. The data distributions at both domain level and categorical level are imbalanced, which is a common situation in many real-world applications. The number of training samples in BoolQ is only 33.8% of those in Marco, indicating that the data are imbalanced across domains. Moreover, the data categorical distribution is imbalanced because the number of No-samples is at least 39.1% less than the number of Yes-samples in both domains. Since the samples from the two domains are collected from different sources, there is a huge domain shift between the two datasets.
The domain adaptation performance is reported in Table 2. The 2 nd column shows the accuracies when the non-adapted source feature extractors and classifiers are directly applied in the target domain, which serves as the lower-baseline. The last column reports the accuracies when the feature extractors and classifiers are trained with full-supervision that all the target training data are labeled (as the upper-baseline). As shown in Table 2, the sourceonly classifiers can only provide 68% accuracy, which verifies that the domain shift hurts the classification accuracy even when a powerful deep language model is adopted. On the other hand, it can be easily observed from Table 2 that EADA enhances the accuracy in both adaptation directions by at least 15% compared to the lower-baseline in an unsupervised manner, and outperforms all the state-of-the-art baselines. It elevates the performance closer to the target full supervision as well.
We leverage t-Distributed Stochastic Neighbor Embedding (t-SNE) to map the embedded feature representations through different feature extractors to a 2-D space for better visualization and analysis. Figure 2(a) and Figure 2(b) depict the embedded features using the non-adapted source feature extractor and the EADA's feature extractor learned, respectively (Yes sample -red, No sample -Green). If we directly apply the non-adapted source feature extractor in the target domain, as shown in Figure 2(a), a large amount samples with different categorical labels overlap with each other, which leads to corresponding huge misclassification as presented in the 2 nd column of Table 2. After em-ploying EADA, the common confusions are further separated in the latent feature space and two clusters are formulated as depicted in Figure 2(b). These observations further validate that the feature embeddings constructed via EADA are not only domain-invariant but also preserve excellent discriminability in both the source domain and the target domain.
We also conducted a sensitivity study of the two hyperparameters: m and γ in Eq(6). We evaluated the impact of margin m with different values from 0 to 10 in both experiments. EADA's accuracy increases while m is increasing. The reason is that the degree of transferability is limited when m is small. The performance becomes stable when m 4 in both experiments. In general, the objective of γ is to control the weight of the adversarial loss during feature learning of ADA as shown in Eq(1). γ in EADA aims to control the weight of autoencoder reconstruction loss for target domain samples during feature learning as presented in Eq(6)). We evaluated the impact of γ with different values (0-1) for EADA and ADA-RoBERTa. As γ increases, the accuracy of EADA increases and becomes stable when γ 0.01. The accuracy of ADA-RoBERTa fluctuates when γ is increased and decreased when γ 0.5. Thus, EADA provides a more stable training procedure compared to conventional ADA methods, which makes it easier for generalization. We recommend using m=4 and γ=0.01 as the default setup for other tasks.

Conclusion
In this paper, we proposed EADA, a novel unsupervised energy-based adversarial domain adaptation method for cross-domain text classification tasks. First, a deep pre-trained language model is leveraged as a shared feature extractor to map the text sequences from both source and target domains to a feature space. The feature extractor and a label predictor are trained with labeled source data. Since the source feature representations are obtained under full supervision, they preserve great feature discriminability. To ensure that the label predictor also provides good label prediction in the target domain, the target feature representations should be encouraged to align with the source during adversarial training. Thus, we designed an autoencoder that focuses on reconstructing the source feature representations, while the feature extractor aims to generate source-like target feature embeddings to fool the autoencoder. Extensive experiments on public cross-domain TC benchmarks are conducted and demonstrate that EADA not only alleviates the domain discrepancy but also enhances the feature discriminability in the target domain, which leads to compelling cross-domain TC performance without requiring any labeled data in the target domain.