Wasserstein Selective Transfer Learning for Cross-domain Text Mining

Transfer learning (TL) seeks to improve the learning of a data-scarce target domain by using information from source domains. However, the source and target domains usually have different data distributions, which may lead to negative transfer. To alleviate this issue, we propose a Wasserstein Selective Transfer Learning (WSTL) method. Specifically, the proposed method considers a reinforced selector to select helpful data for transfer learning. We further use a Wasserstein-based discriminator to maximize the empirical distance between the selected source data and target data. The TL module is then trained to minimize the estimated Wasserstein distance in an adversarial manner and provides domain invariant features for the reinforced selector. We adopt an evaluation metric based on the performance of the TL module as delayed reward and a Wasserstein-based metric as immediate rewards to guide the reinforced selector learning. Compared with the competing TL approaches, the proposed method selects data samples that are closer to the target domain. It also provides better state features and reward signals that lead to better performance with faster convergence. Extensive experiments on three real-world text mining tasks demonstrate the effectiveness of the proposed method.


Introduction
Transfer learning (TL) is a type of classical machine learning methods to leverage information from data-rich source domains to help a data-scarce target domain (Pan and Yang, 2009). Recently, transfer learning based on deep neural networks, referred to as deep transfer learning (Ruder and Plank, 2017;Yosinski et al., 2014), has been widely used on various tasks in natural language processing (Peng et al., 2018;Mou et al., 2016;  2017) and computer vision (Loey et al., 2021;Ganin et al., 2016;Long et al., 2017). Despite its success in various applications, vanilla deep transfer learning approaches may suffer from the negative transfer problem (Chen et al., 2011;Ruder and Plank, 2017;Huang et al., 2007;Qu et al., 2019a;Wang et al., 2019;, as the source and target domains usually have different data distributions. One solution to solve this problem is instance-based transfer learning, which properly selects instances from the source domain to alleviate or avoid the negative transfer. The recent instance-based transfer learning methods incorporate Reinforcement Learning (RL) for data selection (Qu et al., 2019a;Wang et al., 2019;. These methods jointly train an RL based data selector and the transfer learning module, which is demonstrated to be better than previous methods. However, the aforementioned methods suffer from the following two challenges. First, the transfer learning module used in the previous reinforced instance-based TL methods is a simple fully-shared model, which may not be able to learn clean domain-invariant feature representations that are discriminative in prediction (Liu et al., 2017;Shen et al., 2017). This leads to sub-optimal transfer results and the transfer learning module cannot further provide good state representations for the RL module. Second, the environment built by the transfer learning module provides sparse "delayed" rewards and with high variance during the training stage, which makes the RL policy difficult to optimize and cannot provide reasonable signals to guide the data selection process. The RL policy is performed at the instance level, but the delayed reward is at the batch level. Thus, this is a challenging sparse reward problem, i.e., to update batched sequential decisions with a single delayed reward.
In light of these challenges, we propose a Wasserstein distance-based Selective Transfer Learning (WSTL) method to select helpful data from the source domain to help the target. Our method is built on top of the reinforced transfer learning framework, but differently, we introduce a Wasserstein-distance based discriminator. The advantages of the discriminator are two-fold. First, the discriminator is trained to distinguish features between the source and target, and further guides the transfer learning module to minimize the estimated Wasserstein distance in an adversarial manner. Hence, features learned in the transfer learning module are domain-invariant (i.e., more transferrable) and also discriminative in prediction (i.e., yields better transfer learning results). Second, the discriminator also provides Wasserstein distancebased metric to serve as an immediate reward signal to help guide RL policy, which can solve the sparse reward problem. In this way, the proposed method can select high-quality source data to help the target in an efficient and effective manner.
To demonstrate the benefits of the proposed WSTL method, we evaluate it on three real-world text mining tasks including paraphrase identification, natural language inference, and review text analysis. Experimental results on all these datasets show that the proposed method outperforms the state-of-the-art methods by a large margin. Empirical studies also confirm that the proposed method can select source domain data that are close to the target domain, thus indeed helps to reduce domain discrepancy and alleviate negative transfer.
We summarize the contributions as follows.
1) We proposed a Wasserstein distance based reinforced selective transfer learning method to select high-quality data efficiently and effectively to alleviate negative transfer.
2) The introduced Wasserstein discriminator provides better state representations and immediate reward signals to the reinforced selector, which leads to more stable training and better performance with faster convergence.
3) Experiments on three real-world text mining tasks demonstrate the proposed method significantly outperforms the state-of-the-art methods. The empirical studies also confirm that the data instances selected by the proposed method are closer to the target domain.

Proposed Method
We formulate the problem in a standard transfer learning setting. Given a source domain D e = {x e i , y e i } ne i=1 and a target domain D t = {x t i , y t i } nt i=1 , our model aims to improve the performance in D t using the rich knowledge in D e which is usually much larger than D t .

Model Architecture
As shown in Figure 1, our model consists of three components: a transfer learning model f ω , a reinforced data selector f θ and a Wasserstein distancebased discriminator f ϕ . The discriminator estimates the empirical Wasserstein distance between the source domain and target domain. The TL module is then trained to minimize the estimated Wasserstein distance adversarially. The discriminator and TL module give immediate and delayed feedback respectively to guide the reinforced data selector. Details are as follows.

Wasserstein-based Discriminator
The goal for the discriminator f ϕ is to distinguish source from target data. We consider using Wasserstein distance (Panaretos and Zemel, 2019) as the domain discrepancy measure, as it can provide more stable gradients even if the two distributions are distant . Based on (Shen et al., 2017;Villani, 2008), the Wasserstein distance between source and target probability measures P e , P t can be approximated as: (1) where ||f || L denotes Lipschitz constant. 1 Let n e and n t denote the number of source samples and target samples respectively, the empirical Wasserstein distance can be approximated by maximizing the discriminator loss L wd when f ϕ is 1-Lipschitz: (2) We enforce the Lipschitz constraint for ϕ by gradient penalty as suggested in Gulrajani et al. (2017). So the objective of the Wasserstein-based discriminator is as follows: The Wasserstein discriminator aims to estimate the empirical Wasserstein distance between the source and target domains. The TL module is then trained adversarially with the discriminator to better learn domain invariant features. A Wasserstein distance based metric is designed to provide the immediate reward, and coupled with the delayed rewards from the evaluation environment of the TL module, to guide the RL module. Meanwhile, the RL module conducts data selection based on the state representations of the TL module.
where Pv is sampled uniformly along straight lines between source and target representation pairs and λ is the penalty coefficient.

TL with Wasserstein Discriminator
To make features learned in the TL module more domain invariant and suitable for transferring, the TL module and discriminator are trained adversarially: the discriminator tries to distinguish source from target data, while the TL model aims to fool the discriminator by minimizing the distances between source and target data adversarially. We adopt Wasserstein distance as the adversarial loss since the Jensen-Shannon divergence adopted in previous adversarial methods (Ganin et al., 2016;Qu et al., 2019b) suffers from discontinuities, providing less useful gradients for training. In contrast, Wasserstein distance is continuous and differentiable almost everywhere . The superiority of Wasserstein distance for training is also verified in our experiments.
Specifically, we first train the discriminator to optimality via stochastic gradient ascent. Then we fix the optimal parameter of the discriminator and update the TL module simultaneously. Thus, the final loss of domain adversarial learning can be formulated as: where L k (y, f ω (x)) (k ∈ {e, t}) denotes the loss of the source and target classifier. λ is set as 0 when optimizing the minimum operation since the gradient penalty should not guide the TL learning process. β is the coefficient that controls the balance between domain invariant learning and discriminative feature learning.

Reinforced Selective Training
The reinforced source data selector serves as an agent and the selection process can be modeled as a Markov decision process which can be solved by reinforcement learning: The selector selects a subset of source data, feeds into the TL module with the target data and receives rewards for this action. Selection policy π θ is learned by interacting with the TL environment. Specifically, let b denote batch index, n denotes the batch size, we first obtain state representation S e b = {s 1 , s 2 , ..., s n } for the source samples based on the semantic features generated by the shared encoder and the prediction results from the target classifier. The action A e b = {a 1 , a 2 , ..., a n } where a i ∈ {0, 1} means to drop or keep the i-th instance, is decided by the policy π θ : where g is ReLU activation, W k and b k are the weight matrix and bias of the k-th layer. Reward Design. The objective of the data selector is to maximize the expected total reward J(θ): where θ denotes the parameters of the data selector. The RL reward signal R tot consists of both immediate and delayed rewards, defined as follows. Immediate Reward. Our immediate reward is based on the Wasserstein metric to encourage the data selector to select data instances close to the target domain. Let L wd (x e i , x t * ) be the distance metric, it can be resolved to the following special case of the 1st Wasserstein distance: where d(i, j) denotes the Euclidean distance between x i and x j . Here x i and x j denote the output of the shared encoder in source and target domains respectively. T ij measures the "travelling cost" from sample i in the source domain to sample j in the target domain. a i denotes the binary selection action on the data instance x i . The optimal solution is to move all probability mass of the i-th instance in X e b to its most similar instance To encourage the data selector to choose source data instances close to the target, the immediate reward is formulated as: whereL wd (x e * , x t * ) denotes the averaged distance between all the source and target data. The data selector seeks to find the optimal solution for L wd (x e i , x t * ) to maximize the immediate reward. Since using the whole dataset is computationally expensive, we resolve to compute this metric at batch level to speed up the training process.
2 Please refer to Appendix for the detailed proof.
Delayed Reward. The delayed reward is based on the evaluation metric which measures the performance difference before and after TL model updates: where L(y, f ω (x)) denotes the evaluation results of the updated model, and L (y, f ω (x)) denotes the previous evaluation results. Based on our empirical results, we set L as the accuracy of the target for classification tasks, and as correlation coefficients between the predicted score and the ground truth score for regression tasks.
In contrast to conventional reinforcement learning, our model is updated in batches to improve the model training efficiency. For each batch in an episode, the accumulated reward is defined as: where γ is a discount factor. Total Reward. The total reward is defined as the combination of immediate and delayed reward: where α is the coefficient that balances the contribution of immediate reward and delayed reward.
Optimization. We adopt the policy gradient algorithm (Williams, 1992) to maximize the expected reward J(θ). The parameter θ is updated as: where K is the selected data size. We present the training process in Algorithm 1. Clearly, the Wasserstein based discriminator serves to (1) help the TL module to learn domain-invariant features, and provide stable state representation for RL module, and (2) provide Wasserstein distancebased metric to serve as an immediate reward signal to help guide RL policy.

Algorithm 1: Wasserstein distance-based Selective Transfer Learning (WSTL)
Require :Source and target domain data X e , X t , validation data V 1 Initialize dicriminator f ϕ , TL module f ω and data selector f θ with random weights; 2 Pre-train discriminator using X e , X t ; 3 Pre-train TL module using X e , X t ; Conduct data selection using Eq. (5) 10 Obtain immediate reward using Eq. (8) 11 Update TL module f ω using Eq. (4) 12 Obtain delayed reward using Eq. (10) 13 Obtain total reward using Eq. (11) 14 Update RL module f θ using Eq. (12

Experiments on PI and NLI
We compare our model with these baseline models.  Table 1, we have several observations.
(1) We observe that the proposed WSTL method achieves the best performance and significantly outperforms other methods on both tasks, which shows the superiority and effectiveness of the method.
(2) Due to the domain shift, Src-only performs worse than Tgt-only. The TL method outperforms the Tgt-only model, for leveraging information from the source domain can help the target domain.
(3) Existing instance selection methods such as RTL and MGTL have improved performance over TL, which shows that data selection can alleviate negative transfer. WSTL outperforms all competing methods, indicating the proposed Wassersteinbased discriminator can help both RL and TL to stabilize training and promote better selection.
(4) The improvement over two variants of our model WSTL w/o adv and WSTL w/o RL also shows the importance of Wasserstein-based adversarial training and RL module.
(5) Compared with JS divergence, the proposed WSTL method outperforms WSTL-JS, which shows the effectiveness of the Wasserstein distance metric.
In general, by integrating the TL module, RL module and the Wasserstein-based discriminator in a unified framework, WSTL can significantly outperform the competing methods on all tasks.

Experiments on Review Helpfulness Prediction
Review helpfulness prediction is a regression task that predicts the helpfulness score of a given review. We compare our model with regression baselines that use hand-crafted features which are STR, UGR, LIWC, INQUIRER, aspect-based feature ASP, and two groups of ensemble features named Fusion1 and Fusion2 from Chen et al. (2018a). We also compare with Src-only, Tgt-only (Kim, 2014), Vanilla TL Method, and two degenerated versions of our model WSTL w/o adv and WSTL w/o RL as described in Section 3.2. In addition, we compare with two recent proposed TL methods for the task: • TL-dd (Chen et al., 2018a): a cross-domain model with auxiliary domain discriminators. • TL-adv (Liu et al., 2017): TL method with adversarial training to alleviate the shared and private features from interfering each other. As shown in Table 2, we have similar observations with the experiments on PI and NLI. (1) Those traditional methods without using deep neural networks, e.g., from STR, UGR to Fusion methods, cannot perform as well as deep learning methods. This shows deep learning methods have the ability to extract more important semantic features for the task. (2) Source and target domain data have similar but different data distributions since Src-only performs worse than Tgt-only. (3) Transfer learning methods significantly outperform "nontransfer" methods and hand-crafted features, which

Effects of the Wasserstein Discriminator
To demonstrate the effectiveness of the Wasserstein discriminator, we compare our method with the vanilla RTL method (Qu et al., 2019a). Distances between Domains. We first compare the distances between the selected data and the target data on "Watch" task. The Wasserstein distances for the original data, selected data from RTL, and selected data from our WSTL are 3.841e-06, 3.782e-06, and 3.436e-06, respectively. Clearly, both WSTL and RTL methods can reduce distances between source and target data distribution, while WSTL can select data instances closer to the target. Feature Visualization. We use t-SNE to demonstrate the proposed method's capacity of minimizing the divergence between source and target distributions. As shown in Figure 2, we observe that instances from different domains are closer in Figure 2(b). It corroborates that by adopting Wasserstein discriminator, WSTL can help to learn domain invariant features which can effectively reduce domain discrepancy. The detailed feature distributions at different training steps are in Appendix. Moreover, WSTL improves around 2% over RTL in Table 1 and 1% on average in Table 2, which shows WSTL can learn discriminative features with better results on target domain. Convergence and Stability. We compare the convergence and stability between WSTL and RTL methods on "Watch" task in review helpfulness prediction. As shown in Figure 3, we can observe that WSTL converges faster and performs more stably than the RTL method. The reason is that Wasserstein distance-based discriminator can stabilize learning and provide domain invariant representation and Wasserstein distance-based metric to help reinforced data selector learn more efficiently. We also find WSTL performs better than RTL method since better state representations and reward signals provided by Wasserstein distancebased domain discriminator can make WSTL select high-quality source data to improve performance in the target domain.

Parameter Sensitivity Analysis
Effect of different rewards. We test the effect of different rewards on the data selector learning.
As in Figure 4(a), we can observe that both delayed reward and immediate reward are helpful for the task, and the best performance can be achieved by combining with both types of rewards.
We also demonstrate the impact of hyperparameter α which measures the contribution of immediate reward and delayed reward. As in Figure 4(b), we observe that our method is generally robust to different values of the hyper-parameter α as it shows to have similar performance with different settings.
Effect of Feature Learning. We tested the impact of hyper-parameter β which controls the balance between domain invariant learning and discriminative feature learning. We tested with different values of hyperparameter β = {0.01, 0.1, 1, 10} in Equation 4. As shown in Figure 5, we can observe that the performance on PI task is less sensitive to parameter β than NLI task. Figure 6: The proportion of "address" in selected data with the same semantic meaning as in target domain.

Case Study
To examine whether our model can select desirable samples from the source domain, we design an experiment to examine the behaviors of the data selector. In source NLI data, the term "address" has two meanings, i.e., a verb that begins to deal with something, and a noun that describes a location. However, in target NLI data, the term is mostly about the first meaning. Hence we remove sentences with the second meaning in the target and collect examples containing "address" in both datasets, to examine whether the proposed method can select proper sentences. We compute the proportion of selected source examples containing the same semantic meaning of "address" as in SciTail.
From Figure 6, we observe that with model training, the data selector can select more instances with the same "address" meaning as in SciTail. This result is insightful as it shows the proposed method can transfer relevant examples to promote positive transfer and ignore irrelevant ones to mitigate negative transfer.

Related Work
Transfer learning (TL). TL has been widely studied in various applications (McCann et al., 2017;Deng et al., 2009;Yosinski et al., 2014). Due to the domain difference, a vanilla transfer learning method may suffer from negative transfer. There are generally two lines of study to address this problem. The first is feature-based methods, which aim to locate a common feature space that can reduce the differences between the source and target domains (Shen et al., 2017). However, the capacity of shared space could be consumed by some unnecessary features. The second category is instancebased methods, which re-weight the source samples so that data from the source and target domain share a similar data distribution (Chen et al., 2011;Huang et al., 2007;Ruder and Plank, 2017). The TL module is typically considered as a sub-module of the data selection framework (Ruder and Plank, 2017). Therefore, the TL module needs to be retrained repetitively to provide sufficient updates to the selection framework which may suffer from long training time. The recent studies (Qu et al., 2019a;Wang et al., 2019) consider RL based instance selection methods to jointly train the data selector and the TL model. However, the training of the TL module struggles to maintain stable and the mislead signal from the TL model inevitably increases the difficulty of data selection. Our method employs the Wasserstein discriminator to help both TL and RL modules. It provides domain invariant features to serve as states for the RL policy and helps to improve TL module via adversarial training. The Wasserstein discriminator also provides immediate rewards to guide the RL policy.
Different from several domain adaption methods proposed for sentiment classification (Qu et al., 2019b;Du et al., 2020;Xue et al., 2020;Zhang et al., 2019;He et al., 2018) where the labels in the target domain are not available, both the source and target labels are available in our transfer learning setting. In our setting, we seek to leverage data-sufficient domains to help target domains with less sufficient data labels. Besides, these methods are designed for sentiment classification while our method is more general on various task such as PI and NLI.
Wasserstein Distance. The Wasserstein distance (Panaretos and Zemel, 2019) is a metric based on the theory of optimal transport. It gives a natural measure of the distance between two probability distributions.  introduce Wasserstein metric to alleviate the vanishing gradient and the mode collapse issues in the original GAN (Goodfellow et al., 2014). Chen et al. (2018b) propose to minimize the Wasserstein distance between different domains for cross-lingual sentiment classification. Shen et al. (2017) adopts Wasserstein distance to representation learning for domain adaptation. However, the learned representations are contaminated by misleading features, suffering from feature redundancy. Yu et al. (2020) introduces Wasserstein distance as a regularizer to improve the sequence representations. Inspired by its success in various applications, we introduce Wasserstein distance to selective transfer learning.

Conclusions
To alleviate the negative transfer issues, we propose a Wasserstein Selective Transfer Learning (WSTL) method that builds a Wasserstein discriminator to maximize the empirical distance between the selected source and target domain data. The TL module is trained to minimize the estimated Wasserstein distance in an adversarial manner, and the discriminator provides immediate rewards further coupled with the delayed rewards from the TL module to guide the reinforced data selector. Extensive experiments on three real-world datasets show the proposed method significantly outperforms the competing methods.
A Appendices.

A.1 Proof for Equation 7
The optimal solution is to move all probability mass of each instance in X e b to its most similar instance in X t b . Precisely, an optimal T * matrix is defined as: Below we derive the solution.
PROOF. Let j * = arg min j d(i, j), then we have: Therefore, T * must yield a minimum objective value. Thus, for each instance vector x i in X e b , we can achieve the minimum objective value by a nearest neighbor search to obtain j * where j * = arg min j d(i, j), and obviously the distance metric L wd (x e i , x t * ) = a i d(i, j * ).
A.2 Experiment Settings PI task. This is a typical text matching task that is widely used in dialogue systems (Qu et al., 2019a). The task is to examine the relationship, i.e., paraphrase or not, between two input text sequences. We treat the Quora question pairs 5 as the source domain and AnalytiCup 6 dataset as the target.  Review helpfulness prediction. This is a text mining task to examine the helpfulness score of a given review. Due to the high volume of reviews on Ecommerce sites, it's an important task that draws increasing attention from both academia and industry (Martin and Pu, 2014). We use reviews from five categories of products in the Amazon review dataset (McAuley and Leskovec, 2013). The data from the Electronics domain (the largest dataset) are served as the source domain, while the rest four domains are treated as target domains. Data statistics are summarized in Table 3. The experiments are conducted on a Linux server equipped with an Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz and 8 NVIDIA V100-SXM2-16GB GPUs. We implement our model via TensorFlow and the models are trained with Adam optimizer (Kingma and Ba, 2014). The discriminator is designed with a hidden layer of 128 nodes. The learning rate for the discriminator is 1e−4 and the training steps n is set to 5 for fast computation and sufficient optimization guarantee for the discriminator. The penalty coefficient λ is set to 10 as suggested in (Gulrajani et al., 2017).
For both PI and NLI tasks, the size for the hidden layers of the decomposable attention model is 256. The max sequence length is set to 50. Word embeddings are initialized with GloVe word vectors (Pennington et al., 2014) and are set to trainable. The initial learning rate is set as 0.001. The transfer learning model is pre-trained for 50 iterations before the reinforced data selector is applied. For review helpfulness prediction task, the lookup table is also initialized with pre-trained vec- tors from Glove. For CNN, the activation function is ReLU, and the channel size is set to 128. Multiple filters are used here with window size l ∈ {2, 3, 4, 5}. We use Adam as the optimizer. Following the previous work (Chen et al., 2018a), ten-fold cross-validation is performed for all experiments and all the results are evaluated in correlation coefficients between the predicted helpfulness score and the ground truth score computed by "a of b approach" from the dataset.

A.3 Feature Visualization
To demonstrate the transferability of the selected features, we use t-SNE to visualize the selected fea-ture representation from the source domain and target domain. We choose the "Watch" task in review helpfulness prediction as an example. In Figure 7, red points denote samples from the source domain, green points denote samples from the target domain. We can observe that with the model training, red and green points are getting closer, which denotes that our model can select useful source data instances that are close to the target domain. It proves that adopting Wasserstein distance in adversarial training can help learn domain invariant features, and such features can help to effectively alleviate negative transfer.