Matching Distributions between Model and Data: Cross-domain Knowledge Distillation for Unsupervised Domain Adaptation

Unsupervised Domain Adaptation (UDA) aims to transfer the knowledge of source domain to the unlabeled target domain. Existing methods typically require to learn to adapt the target model by exploiting the source data and sharing the network architecture across domains. However, this pipeline makes the source data risky and is inflexible for deploying the target model. This paper tackles a novel setting where only a trained source model is available and different network architectures can be adapted for target domain in terms of deployment environments. We propose a generic framework named Cross-domain Knowledge Distillation (CdKD) without needing any source data. CdKD matches the joint distributions between a trained source model and a set of target data during distilling the knowledge from the source model to the target domain. As a type of important knowledge in the source domain, for the first time, the gradient information is exploited to boost the transfer performance. Experiments on cross-domain text classification demonstrate that CdKD achieves superior performance, which verifies the effectiveness in this novel setting.


Introduction
Annotating sufficient training data is usually an expensive and time-consuming work for diverse application domains. Unsupervised Domain Adaptation (UDA) aims at solving this learning problem in the unlabeled target domain by utilizing the abundant knowledge in an existing domain called source domain, even when these domains may have different distributions. This technique has motivated research on cross-domain text classification Ye et al., 2020;Gururangan et al., 2020). One of the important knowledge in the source domain is the labels of samples. Current methods mainly leverage the labeled source * Corresponding author. data and unlabeled target data to learn the domaininvariant features (Tzeng et al., 2014;Ganin and Lempitsky, 2015) and the discriminative features (Saito et al., 2017;Ge et al., 2020) that are shared across different domains.
Unfortunately, sometimes we are forbidden access to the source data, which are distributed on different devices and usually contain private information, e.g., user profile. Existing methods cannot solve the UDA problem without the source data yet. In addition, it is necessary to adapt the target domain with a flexible network architecture different from the source domain in terms of different deployment requirements for different domains. But most of works  are required to share the same network architecture between different domains. In this paper, we propose a novel UDA setting: only a trained source model and a set of unlabeled target data are provided, and the target model is allowed to have different network architectures with the trained source model. It differs from the vanilla UDA in that a trained source model instead of source data is provided as supervision to the unlabeled target domain when learning to adapt the model. Such a setting satisfies privacy policy and effective delivery, and helps deploy the target model flexibly according to the target application.
Our setting seems somewhat similar to Knowledge Distillation (KD) (Hinton et al., 2015), where a trained teacher model teaches a student model with different architecture on the same task over a set of unlabeled data. KD assumes that the empirical distribution of the data used for training the student model matches the distribution associated with the trained teacher model. Nevertheless, in our setting, the unlabeled data and teacher (source) model have different distributions. One of simple yet generic solution for our setting is to match the distributions between source and target domains under the process of distilling the knowledge. How-ever, it is quite challenging to reduce the shifts between a known distribution (e.g., a trained source model) and the empirical distribution of data (e.g., target data). Prior methods minimize a distance metric of domain discrepancy, such as Maximum Mean Discrepancy (MMD) (Tzeng et al., 2014) to match the distributions across domains in terms of the source and target data. Unfortunately, the empirical evaluation of these metrics is unavailable since we cannot access the source data.
In this paper, we propose a generic framework named Cross-domain Knowledge Distillation (CdKD). Specifically, we define a Joint Kernelized Stein Discrepancy (JKSD) that measures the largest discrepancy over the Hilbert space of functions between empirical sample expectations of target domain and source distribution expectations. Inspired by the works (Liu et al., 2016), the source distribution expectations are being zero via the effect of Stein operator such that we can evaluate the discrepancy of joint distributions without any source data. We embed JKSD criterion into deep network where multi-view features including activations, gradients and class probabilities in the source model are exploited to explore the domain-invariant and discriminative features across domains. In addition, we further maximize JKSD using adversarial strategy where the multi-view features are integrated into domain adaptation abundantly. Finally, CdKD is learnt by joint optimizing both KD objective (Hinton et al., 2015) and JKSD. The main contributions are outlined as, • We propose to investigate the problem of UDA without needing source data by exploring the distribution discrepancy between a source model and a set of target data. We adapt the target domain with different network architecture flexibly in terms of different deployment environments.
• For the first time, the gradient information of the source domain is exploited to boost the UDA performance. Mu et al. (2020) shows a key intuition that per-sample gradients contain task-relevant discriminative information.
• We experiment under two Amazon review datasets for cross-domain text classification, which demonstrates that CdKD still has obvious performance advantage in all settings though without needing any source data.  (Tzeng et al., 2014) and adversarial learning (Ganin and Lempitsky, 2015) are commonly used to learn domain-invariant features by aligning the marginal distributions. To learn discriminative features for UDA, self-training methods (Saito et al., 2017;Zou et al., 2019) train the target classifier in terms of the pseudo labels of target data. These works committed to improve the quality of pseudo labels including introducing mutual learning (Ge et al., 2020) and dual information maximization (Ye et al., 2020). The other line of learning discriminative features is to match the conditional distributions across domains by aligning multiple domain-specific layers (Long et al., 2017(Long et al., , 2018 or making an explicit hypothesis between conditional distributions Yu et al., 2019;Fang et al., 2020). STN (Yao et al., 2019) explores the class-conditional distributions to approximate the discrepancy between the conditional distributions via Soft-MMD. The work (Zhang et al., 2021) derives a novel criterion Conditional Mean Discrepancy (CMD) to measure the shifts between conditional distributions in tensorproduct Hilbert space directly. However, these methods assume the target users can access to the source data, which is unsafe and sometimes unpractical since source data may be private and decentralized. Therefore, the recent works propose to generalize a target model over a set of unlabeled target data only in terms of the supervision of a trained source model. SHOT  learns the target-specific feature extraction module by using both information maximization and self-training strategy.  improve the target model through target-style data based on generative adversarial network (GAN) where the GAN and the target model are collaborated without source data. Unfortunately, they require that the target model must share the same network architecture with the source model. Meanwhile, multi-view features in the source model including activation and gradient are not exploited which also contribute most to the domain adaptation.  Figure 1: The proposed CdKD framework for UDA without source data.

Knowledge Distillation (KD)
KD transfers the knowledge from a cumbersome model to a small model that is more suitable for deployment (Hinton et al., 2015). The general technique of KD involves using a teacher-student strategy, where a large deep teacher model trained for a given task teaches shallower student model on the same task (Yim et al., 2017;. The teacher and student models are trained based on the same data. These KD methods make an assumption that the training data and the distribution associated with the teacher model are independent and identically distributed. However, sometimes we are required to train a student model in a new domain that the teacher model is not familiar, i.e, the domain shifts exist between the new domain and the domain that the teacher model is trained. The proposed CdKD is able to relieve the domain shifts adaptively during distilling the knowledge.

Methodology
We address the unsupervised domain adaptation (UDA) task with only a trained source model and without access to source data. We consider K-way classification. Formally, in this novel setting, we are given a trained source model f s : CdKD is a special KD which consists of a trained teacher model f s , a student model f t and unlabeled data D t as well. But it differs from KD in that the empirical distribution of D t don't match the distribution associated with the trained model f s . Therefore, it is necessary to introduce distribution adaptation to eliminate the biases between the source and target domains during distilling the knowledge. Specifically, as shown in Figure 1(a), we first introduce KD to distill the knowledge to the target domain in terms of the class probabilities produced by the source model f s . Then, we introduce a novel criterion JKSD to match the joint distributions across domains by evaluating the shift between a known distribution and a set of data. This is the first work to explore the distribution discrepancy between a model and a set of data in UDA task.

Distilling Knowledge to Target Domain
Given a target sample x ∈ D t , the target model f t : X → Y produces class probabilities by using a "softmax" output layer that converts the logits p = (p 1 , · · · , p K ) into a probability f t (x) = (q 1 , · · · , q K ), where T is a temperature used for generating "softer" class probabilities. We optimize the target model f t by minimizing the following objective for knowledge distillation, In our paper, the setting of temperature follows the work (Hinton et al., 2015): a high temperature T is adopted to compute f t (x) during training, but after it has been trained it uses a temperature of 1.

Joint Kernelized Stein Discrepancy
In traditional UDA setting, Joint Maximum Mean Discrepancy (JMMD) (Long et al., 2017) has been applied to measure the discrepancy in joint distributions of different domains, and it can be estimated empirically using finite samples of source and target domains. Specifically, suppose k : X ×X → R and l : Y × Y → R are the positive definite kernels with feature maps φ(·) : X → F and ψ(·) : Y → G for domains of X and Y , respectively that corresponds to reproducing kernel Hilbert space (RKHS) F and G . Let C P XY : G → F be the uncentered cross covariance operator that be defined as In our setting, unfortunately, the empirical estimation of JMMD is unavailable since we cannot access the source data D s directly (The empirical estimation of JMMD is in Appendix A.1). Kernelized stein discrepancy (KSD) as a statistical test for goodness-of-fit can test whether a set of samples are generated from a marginal probability (Chwialkowski et al., 2016;Liu et al., 2016). Inspired by KSD, we introduce Joint KSD (JKSD) to evaluate the discrepancy between a known distribution P (X, Y) and a set of dataQ We begin by defining a Stein operator and 1 d is a d × 1 vector with all elements equal to 1. The expectation of Stein operator A P over the distribution P is equal to 0 which can be proved easily by (Chwialkowski et al., 2016, Lemma 5.1). The Stein operator A P can be expressed by defining a function ξ xy over the space F d ⊗ G that depends on gradients of the log-distribution and the kernel, Thus, (A P f ⊗ g)(x, y) can be presented as an inner product, i.e., f ⊗ g, ξ xy F d ⊗G . Now, we can define JKSD and express it in the RKHS by replacing the term f (x)g(y) in J(P, Q) as our Stein operator, where H is a unit ball in F d ⊗ G . This makes it clear why Eq. 3 is a desirable property: we can compute S(P, Q) by computing the Hilbert-Schmidt norm E Q ξ xy , without need to access the data obtained from P . We can empirically estimate S 2 (P, Q) based on the known probability P and finite samplesQ = in term of kernel tricks as follows, In our experiments, we adopt Gaussian ker- Remark. Based on the virtue of goodness-fit test theory, we will have S(P, Q) = 0 if and only if P = Q (Chwialkowski et al., 2016). Instead of applying uniform weights as MMD does, JKSD applies non-uniform weights β i,j , where β i,j = (∇ 2 K + 2Υ + Ω) i,j is, in turn, determined by the activation-based and gradient-based features of the known probability P . JKSD computes a dynamic weight β i,j to decide whether the sample i shares the same label with other sample j in the target domain. Different from cluster-based methods , JKSD assigns each sample a label according to all the data in the target domain instead of the centroid of each category. The computation of centroid severely suffers from the noise due to the domain shifts. In contrast, our solution is more suitable for UDA because we avoid to use the untrusted intermediate results (i.e., the centroid of each category) to infer the labels.

Training
The pipeline of our CdKD framework is shown in Figure 1(b). The source model parameterized by a DNN consists of two modules: a feature extractor T s : X → Z s and a classifier G s : Z s → Y, i.e., f s (x) = G s (T s (x)). The target model f t = T t •G t also has two modules where we use parallel notations T t (·; θ T ) : X → Z t and G t (·; θ G ) : Z t → Y for target model. Note here in our experiments, the dimension of the latent representations of source model is set equal to the target model, i.e., Z s = Z t = R d . The extractors T s and T t are allowed to adopt different network architectures. The input space X is usually highly sparse where the kernel function cannot capture sufficient features to measure the similarity. Therefore, we evaluate JKSD based on latent representations of target samples, i.e.,Q = {(z, y)|z = T t (x), y = G t (z), x ∈ D t } ∼ Q(Z, Y). In Eq. 5, it is required to evaluate the joint probability P (Y = y, Z = z) = p(y|z)p(z) over a sample (z, y) obtained fromQ. The probability p(y|z) that the sample follows conditional distribution of the source domain P (Y|Z) can be evaluated as p(y|z) = y G s (z). Similarly, the term p(z) represents the probability that the target representation z follows the marginal distribution P (Z) of the source domain. Since we cannot access the source marginal distribution directly, we approximate it by evaluating the cosine similarity of the representations outputted from the source model and target model, i.e., p(z) = 1 2 cos(z, T s (x)) + 1 2 where x = T −1 t (z) is the sample corresponding to z for any z ∈Q. Formally, the term ∇ z log P (z, y) in Eq. 5 can be computed as where ∇ z G s (z) ∈ R K×d is a Jacobian matrix of the target latent representation with respect to the source classifier G s . We propose to train the target model f t by jointly distilling the knowledge from the source domain and reducing the shifts in the joint distributions via JKSD, min where µ > 0 is a tradeoff parameter for JKSD. In order to maximize the test power of JKSD, we require the class of functions h ∈ F d ⊗G to be rich enough. Meanwhile, kernel-based metrics usually suffer from vanishing gradients for low-bandwidth kernels. We are enlightened by (Long et al., 2017) which introduces the adversarial training to circumvent these issues. Specifically, we multiple fully connected layers U and V parameterized by θ U and θ V to JKSD, i.e., k(x i , x j ) and l(y i , y j ) are replaced as k(U (x i ), U (x j )) and l(V (y i ), V (y j )) in Eq. 5. We maximize JKSD with respect to the new parameters θ U and θ V to maximize the test power of JKSD such that the samples in the target domain are made more discriminative by abundantly exploiting the activation and gradient features in the source domain. As shown in Figure  1(c), the target model f t can be optimized by the following adversarial objective, 4 Experiments

Setup
To testify its versatility, we evaluate the proposed model in two tasks including UDA and knowledge distillation. Amazon-Review 1 is a benchmark dataset for domain adaptation in text classification task. Two versions of Amazon Review datasets are used to evaluate models. The work provides a simplified Amazon-Review dataset (Amazon-Feature) collected from four distinct domains: Books (B), DVD (D), Electronics (E) and Kitchen (K). Each domain comprises 4,000 samples with 400d feature representations and 2 categories (positive and negative). Zhang et al. (2021) collected a larger dataset called Amazon-Text from Amazon-Review with the same domains in Amazon-Feature to test the model performance for large-scale transfer learning. The review texts are divided into two categories according to user rating, i.e., positive (5 stars) and negative (1 star). There are 10,000 original review texts in each category and 20,000 texts in each domain. The notation S→T represents the transfer learning from the source domain S to the target domain T.
Baselines. For the bulk of experiments the following baselines are evaluated. The Source-Only model is trained only over source domain and tested over target-domain data while Train-on-Target model is trained and tested over target-domain data directly. We compare with conventional domain adaptation methods: Transfer Component Analysis (TCA) (Pan et al., 2010), Balanced Distribution Adaptation (BDA) , Geodesic Flow Kernel (GFK) (Gong et al., 2012), Deep Domain Confusion (DDC) (Tzeng et al., 2014), Domain Adversarial Neural Networks (RevGrad) (Ganin and Lempitsky, 2015) and Dynamic Adversarial Adaptation Network (DAAN) . We compare with SHOT  for the UDA task without the source data. 1 http://jmcauley.ucsd.edu/data/amazon/ We also compare with the knowledge distillation method (KD) (Hinton et al., 2015) in our setting.
In our experiments, three different extractors are selected. For Amazon-Feature dataset, the extractor is simply modeled as a typical 3-layer fully connected network (MLP) to transform 400d inputs into 50d latent feature vectors. Two types of networks are leveraged for Amazon-Text dataset to encode the original review texts, i.e., TextCNN and BertGRU. TextCNN (Kim, 2014) is a text convolutional network that consists of 150 convolutional filters with 3 different window sizes. We also evaluate the performance of cross-domain text classification on a pre-trained language model, i.e., BERT (Devlin et al., 2019). We freeze BERT model and construct a 2-layer bi-directional GRU (Cho et al., 2014) to learn from the representations produced by BERT. The classifier is modeled as a 2-layer fully connected network for all the settings. For CdKD, we consider to learn the source model f s by minimizing the standard cross-entropy loss. We randomly specify a 0.7/0.3 split in the source dataset and generate the optimal source model based on the validation split. U and V are modeled as weight matrices.
We implement all deep methods based on Pytorch framework, and BERT model is implemented and pre-trained by pytorch-transformers 2 . We adopt Gaussian kernel with bandwidth set to median pairwise squared distances on the training data (Gretton et al., 2012). The temperature T is set to 10 during training. We use AdamW optimizer (Loshchilov and Hutter, 2019) with batch size of 128 and the learning rate annealing strategy in (Long et al., 2017): it is adjusted during back propagation using the following formula: where p is the training progress linearly changing from 0 to 1 and η 0 is set to 0.001. We apply the same strategy in (Ganin and Lempitsky, 2015) to adjust the factor µ dynamically, i.e., we gradually change it from 0 to 1 by a progressive schedule: µ p = 2 1+exp(−10p) − 1.

Results
In the first experiment, we compare with the conventional domain adaptation methods where the source model and target model share the same network architectures. The classification accuracy results on the Amazon-Feature dataset for domain adaptation based on MLP are shown in Table 1. Some of the observations and analysis are listed as follows.
(1) The performance of traditional UDA methods (e.g., TCA, GFK and BDA) is worse than Source-Only model, i.e., negative transfer learning occurs in all transfer tasks. These models directly define kernel over sparse input vectors such that the kernel function cannot capture sufficient features to measure the similarity. The deep transfer methods outperform all the traditional methods, suggesting that embedding domain adaptation modules into deep network can reduce domain discrepancy significantly.
(2) The average accuracy of CdKD is slightly 1.0% higher than other deep transfer methods (DDC, RevGrad, DAAN and SHOT) overall. It verifies the positive effect of transferring the knowledge from trained source model without accessing the source data. Table 2 shows the classification performance of deep UDA models based on TextCNN and BertGRU over a large dataset Amazon-Text. For TextCNN extractor, we have following analysis. CdKD achieves superior performance over prior methods by larger margins compared to small dataset Amazon-Feature. Compared to DDC and RevGrad that obtains the domain-invariant features, CdKD can learn discriminative information from the source model by minimizing JKSD criterion. SHOT assumes that the target outputs should be similar to one-hot encoding. However, the onehot encoding used in SHOT is noisy and untrusted due to the domain shifts. Different from SHOT, we match the joint distributions across domains in terms of multi-view features rather than only class probabilities when adapting the target model. By going from TextCNN to extremely deep BertGRU, we attain a more in-depth understanding of feature transferability. BertGRU-based models outperform TextCNN-based models significantly, which shows BERT enables learning more transferable representations for UDA. Our CdKD has a slight advantage compared to other models overall under the powerful transferability of BertGRU. It reveals the necessity of designing a moment matching approach to incorporate activation and gradient features into domain adaptation for reducing the losses caused by the lack of source data.
In the second experiment, we compare with the KD model where the knowledge in BertGRU is distilled to the TextCNN-based model. We generate the optimal BertGRU as the teacher model based on the source dataset. The TextCNN model uses BERT tokenizer tool to guarantee the same input space between two models. We randomly specify a 0.5/0.2/0.3 split in the target dataset where we train and select TextCNN-based model based on the train split and validation split respectively. The result is reported in Table 3 in terms of the test split. The average accuracy of CdKD is 1.6% higher than original KD and approaches to the teacher model BertGRU. Significantly, the accuracy scores of tasks D → E and D → K are higher than Bert-GRU. This is attributed to distribution adaptation where extra performance is also gained from JKSD besides the guidance of the teacher model.

Analysis
Ablation Study. We conduct the ablation experiments to see the contributions of gradient information (g) and the adversarial strategy (a), which are evaluated with TextCNN extractor for UDA task. By ablating CdKD, we have two baselines of CdKD-g (w/o g) and CdKD-a (w/o a). For CdKD-g, we set the gradient of log-distribution ∇ x j log P (x j , y j ) ∈ R d×1 to a constant, i.e., than CdKD but still better than KD, suggesting that gradient information and the adversarial strategy both contribute to the improvements of our model. The gradient information is one type of important knowledge in the source domain, but all previous methods ignore its importance for UDA. Effects of Source Model Accuracy. Here we study how the performance of target model are influenced by the source model accuracy, which are analyzed based on B → E task using TextCNN extractor. We randomly obtain 9 optimal source models using different seeds over B dataset, and train CdKD and KD models based on different source models for B → E task. Figure 3 shows the classification accuracy of CdKD and KD by varying accuracy of source models tested over E dataset. CdKD obtains similar performance under different source models, indicating that CdKD is not very sensitive to the quality of source models. However, the curves of KD is unstable, i.e., the performance of KD is vulnerable to the impact of the source models, because different source models follow the different distributions. Obviously, JKSD plays a crucial role in determining the effects of alleviating this distribution discrepancy among different source models. Effects of Batch Size. Batch size is a key parameter to optimize JKSD metric because it is required to compute kernel over a min-batch of data. Figure  4 shows the classification accuracy of CdKD by varying batch size in {64, 128, 256, 512}. The experiment shows that CdKD is not sensitive to batch size when batch size is larger than 64, suggesting that CdKD don't need a very large batch size for accurate estimation of JKSD.

Conclusion
In this paper, we shed a new light on the challenges of UDA without needing source data. Specifically, we provided a generic framework named CdKD to learn a classification model over a set of unlabeled target data by making use of the knowledge of the activation and gradient information in the trained source model. CdKD learned the collective knowledge across different domains including domain-invariant and discriminative features by matching the joint distributions between a trained source model and a set of target data. Experiments for cross-domain text classification testified that CdKD still achieves advantages for UDA task though without any source data and improves the performance of KD task when the trained teacher model doesn't match the training data.

A Appendices
A.1 Empirical Evaluation of JMMD JMMD J(P, Q) measures the shifts in joint distributions P (X, Y) and Q(X, Y) by sup f ⊗g∈H E Q (f (x)g(y)) − E P (f (x)g(y)) ∼ Q(X, Y), the empirical estimation of JMMD is, J 2 (P, Q) = 1 n 2 tr(K ss L ss ) + 1 m 2 tr(K tt L tt ) − 2 mn tr(K st L ts ) where (K st ) i,j = k(x s i , x t j ) and (L st ) i,j = l(y s i , y t j ) are gram matrices, and tr(A) is the trace of the matrix A. The Eq. 7 applies the source data K st , K ss , L st and L ss to compute the score of JMMD, which cannot adapt to our new setting obviously. Note here that the JMMD used in our paper is a simplified version of (Long et al., 2017), where we only consider two variables.
The empirical evaluation of JKSD can be computed as, For f = (f 1 , · · · , f d ) ∈ F d and g = (g 1 , · · · , g d ) ∈ F d , the inner product between f and g is defined as Based on this definition, the inner product ∂k(x, x ) ∂x i ∂x i Similar to (Chwialkowski et al., 2016), we can compute h(x, y, x , y ) = λ xy , λ x y F d as, ∇ x log P (x, y) ∇ x log P (x , y )k(x, x ) + ∇ x log P (x, y) ∇ x k(x, x ) Thus, JKSD S 2 (P, Q) is the expectation of h(x, y, x , y )l(y, y ) over the distribution Q, S 2 (P, Q) = E Q E Q h(x, y, x , y )l(y, y ) Given a set of samples D t = {(x i , y i )} m i=1 ∼ Q(X, Y), we can evaluate S 2 (P, Q) as 1 m 2 x,y x ,y h(x, y, x , y )l(y, y ) which can be represented in the matrix form as shown in Eq. 5.