Latent-Optimized Adversarial Neural Transfer for Sarcasm Detection

The existence of multiple datasets for sarcasm detection prompts us to apply transfer learning to exploit their commonality. The adversarial neural transfer (ANT) framework utilizes multiple loss terms that encourage the source-domain and the target-domain feature distributions to be similar while optimizing for domain-specific performance. However, these objectives may be in conflict, which can lead to optimization difficulties and sometimes diminished transfer. We propose a generalized latent optimization strategy that allows different losses to accommodate each other and improves training dynamics. The proposed method outperforms transfer learning and meta-learning baselines. In particular, we achieve 10.02% absolute performance gain over the previous state of the art on the iSarcasm dataset.

A challenge specific to sarcasm detection is the difficulty in acquiring ground-truth annotations. Human-annotated datasets (Filatova, 2012;Riloff et al., 2013;Van Hee et al., 2018;Oprea and Magdy, 2020) usually contain only a few thousand texts, resulting in many small datasets. In comparison, automatic data collection using distant supervision signals like hashtags (Ptáček et al., 2014;Bamman and Smith, 2015;Joshi et al., 2015) yielded * Corresponding authors substantially larger datasets. Nevertheless, the automatic approach also led to label noise. For example, Oprea and Magdy (2020) found nearly half of the tweets with sarcasm hashtags in one dataset are not sarcastic.
The existence of diverse datasets and data collection methods prompts us to exploit their commonality through transfer learning. Specifically, we transfer knowledge learned from large and noisy datasets to improve sarcasm detection on small human-annotated datasets that serve as effective performance benchmarks.
Adversarial neural transfer (ANT) (Ganin and Lempitsky, 2015;Liu et al., 2017;Kim et al., 2017;Kamath et al., 2019) employs an adversarial setup where the network learns to make the shared feature distributions of the source domain and the target domain as similar as possible, while simultaneously optimizing for domain-specific performance. However, as the domain-specific losses promote the use of domain-specific features, these training objectives may compete with each other implicitly. This leads to optimization difficulties and potentially degenerate cases where the domain-specific classifiers ignore the shared features and no meaningful transfer occurs between domains.
To cope with this issue, we propose Latent-Optimized Adversarial Neural Transfer (LOANT). The latent optimization strategy can be understood with analogies to to one-step look-ahead during gradient descent and Model-Agnostic Meta Learning (Finn et al., 2017). By forcing domain-specific losses to accommodate the negative domain discrimination loss, it improves training dynamics (Balduzzi et al., 2018).
With LOANT, we achieve 10.02% absolute improvement over the previous state of the art on the iSarcasm dataset (Oprea and Magdy, 2020) and 3.08% improvement on SemEval-18 dataset (Van Hee et al., 2018). Over four sets of transfer learning experiments, latent optimization on aver-age brings 3.42% improvement in F-score over traditional adversarial neural transfer and 4.83% over a similar training strategy from Model-Agnostic Meta Learning (MAML) (Finn et al., 2017). In contrast, traditional ANT brings an average of only 0.9% F-score improvement over non-adversarial multi-task learning. The results demonstrates that LOANT can effectively perform knowledge transfer for the task of sarcasm detection and suggests that the proposed latent optimization strategy enables the collaboration among the ANT losses during optimization.
Our contributions can be summarized as follows: 1. Inspired by the existence of multiple small sarcasm datasets, we propose to use transfer learning to bridge dataset differences. To the best of our knowledge, this is the first study of transfer learning between different sarcasm detection datasets.
2. We propose LOANT, a novel latent-optimized adversarial neural transfer model for crossdomain sarcasm detection. By conducting stochastic gradient descent (SGD) with onestep look-ahead, LOANT outperforms traditional adversarial neural transfer, multitask learning, and meta-learning baselines, and establishes a new state-of-the-art Fscore of 46.41%. The code and datasets are available at https://github.com/ guoxuxu/LOANT.

Sarcasm Detection
Acquiring large and reliable datasets has been a persistent challenge for computational detection of sarcasm. Due to the cost of annotation, manually labeled datasets (Walker et al., 2012;Riloff et al., 2013;Wallace et al., 2014;Abercrombie and Hovy, 2016;Oraby et al., 2016;Van Hee et al., 2018;Oprea and Magdy, 2020) typically contain only a few thousand texts. Automatic crawling (Ptáček et al., 2014;Bamman and Smith, 2015;Joshi et al., 2015;Khodak et al., 2018b) using hashtags or markers yields substantially more texts, but the results are understandably more noisy. As a case study, after examining the dataset of Riloff et al. (2013), Oprea and Magdy (2020) found that nearly half of tweets with sarcasm hashtags are not sarcastic. In this paper, we evaluate performance on the manually labeled datasets, which are relatively clean and can serve as good benchmarks, and transfer the knowledge learned from automatically collected datasets. Traditional sarcasm detection includes methods based on rules (Tepperman et al., 2006) and lexical (Kreuz and Caucci, 2007) and pragmatic patterns (González-Ibánez et al., 2011). Context-aware methods (Rajadesingan et al., 2015;Bamman and Smith, 2015) make use of contexts, such as the author, the audience, and the environment, to enrich feature representations.
Deep learning techniques for sarcasm detection employ convolutional networks (Ghosh and Veale, 2016), recurrent neural networks (Zhang et al., 2016;Felbo et al., 2017;Wu et al., 2018), attention (Tay et al., 2018), and pooling (Xiong et al., 2019) operations. Amir et al. (2016) incorporate historic information for each Twitter user. Cai et al. (2019) consider the images that accompany tweets and Mishra et al. (2017) utilize readers' gaze patterns. To the best of our knowledge, no prior work has explored transfer learning between different sarcasm datasets.
Theoretical analysis (Ben-David et al., 2010) indicates that a key factor for the success of transfer is to reduce the divergence between the feature spaces of the domains. Ganin and Lempitsky (2015) propose to minimize domain differences via a GANlike setup, where a domain discriminator network learns to distinguish between features from two domains and a feature extraction network learns to produce indistinguishable features, which are conducive to transfer learning.
However, as shown in our experiments, adding the domain discriminator to MTL does not always result in improved performance. We attribute this to the implicit competition between the negative domain discrimination loss and the domain-specific losses, which causes difficulties in optimization. In this paper, we improve the training dynamics of adversarial transfer learning using latent optimization on BERT features.

Meta-Learning and Latent Optimization
The idea of coordinating gradient updates of different and competing losses using gradient descent with look-ahead has been explored in Latentoptimized Generative Adversarial Network (LO-GAN) (Wu et al., 2019b,a), Symplectic Gradient Adjustment (Balduzzi et al., 2018;Gemp and Mahadevan, 2019), Unrolled GAN (Metz et al., 2016), Model-Agnostic Meta Learning (Finn et al., 2017) and extragradient (Azizian et al., 2020). The difference between LOGAN and other techniques is that the LOGAN computes the derivative of the randomly sampled latent input, whereas other methods compute the second-order derivative in the model parameter space.
In this paper, we generalize latent optimization from GANs to multi-task learning, where the adversarial loss is complemented by domain-specific task losses. In addition, we apply latent optimization on the output of the BERT module, which differs from the optimization of the random latent variable in LOGAN. As large pretrained masked language models (PMLMs) gain prominence in NLP, latent optimization avoids gradient computation on the parameters of enormous PMLMs, providing reduction in running time and memory usage.

The LOANT Method
In supervised transfer learning, we assume labeled data for both the source domain and the target domain are available. The source domain dataset D s comprises of data points in the format of (x s , y s ) and the target domain dataset D t comprises of data points in the format of (x t , y t ). The labels y s and y t are one-hot vectors. The task of supervised crossdomain sarcasm detection can be formulated as learning a target-domain function f t (x t ) that predict correct labels for unseen x t .   (Liu et al., 2017;Kamath et al., 2019;Kim et al., 2017). We use a large pretrained neural network, BERT (Devlin et al., 2019), as the sentence encoder, though the architecture is not tied to BERT and can use other pretrained encoders. We denote the parameters of the BERT encoder as w b , and its output for data in the source domain and the target domain as z s ∈ R D and z t ∈ R D respectively. We denote this encoder operation as On top of these outputs, we apply domain-specific dense layers to create domain-specific features v s , v t and shared dense layers to create shared features u s , u t . We use w s , w t , and w sh to denote the parameters for the source dense layers, the target dense layers, and the shared dense layers.
The concatenation of features [v s , u s ] is fed to the source-domain classifier, parameterized by θ s ; [v t , u t ] is fed to the target-domain classifier, parameterized by θ t . The two classifiers categorize the tweets into sarcastic and non-sarcastic and are trained using cross-entropy. For reasons that will become apparent later, we make explicit the reliance on z s and z t : Pretrained Encoder Figure 2: Schematic of the latent optimization strategy. The solid black arrows indicate the forward pass and the dotted red arrows indicate the backward pass.
whereŷ s andŷ t are the predicted labels and i is the index of the vector components. Simultaneously, the domain discriminator learns to distinguish the features u s and u t as coming from different domains. The domain discriminator is parameterized by θ d . It is trained to minimize the domain classification loss, Through the use of the gradient reversal layer, the shared dense layers and the feature encoder maximizes the domain classification loss, so that the shared features u s and u t become indistinguishable and conducive to transfer learning. In summary, the network weights w b , w s , w t , w sh , θ s , θ t are trained to minimize the following joint loss, whereas θ d is trained to minimize L d (z t , z s ).
It is worth noting that the effects of three loss terms in Eq. 4 on the shared parameters w sh and w b may be competing with each other. This is because optimizing sarcasm detection in one domain will encourage the network to extract domain-specific features, whereas the domain discrimination loss constrains the network to avoid such features. It is possible for the competition to result in degenerate scenarios. For example, the shared features u s and u t may become indistinguishable but also do not correlate with the labels y s and y t . The domain classifiers may ignore the shared features u s and u t and hence no transfer happens. To cope with this issue, we introduce a latent optimization strategy that forces domain-specific losses to accommodate the domain discrimination loss.

Latent Representation Optimization
We now introduce the latent representation optimization strategy. First, we perform one step of stochastic gradient descent on −L d on the encoded features z s and z t with learning rate γ, We emphasize that this is a descent step because we are minimizing −L d . After that, we use the updated z s and z t in the computation of the losses The new joint objective hence becomes which is optimized using regular stochastic gradient descent (SGD) on w b , w s , w t , w sh , θ s , and θ t .
Here we show the general case of gradient computation. Consider any weight vector w in the neural network. Equations 5 and 6 introduce two intermediate variables z s and z t , which are a function of the model parameter w. Therefore, we perform SGD using the following total derivative where For every network parameter other than the encoder weight w b , ∂z/∂w is zero. The second-order derivative ∂ 2 L d (z) ∂z ∂w is difficult to compute due to the high dimensionality of w. Since γ is usually very small, we adopt a first-order approximation and directly set the second-order derivative to zero. Letting φ s = [w s , θ s ] and φ t = [w t , θ t ], we now show the total derivatives for all network Algorithm 1: Training of LOANT Input: source data (x s , y s ), target data (x t , y t ), learning rate γ Initialize model parameters w repeat Sample N batches of data pairs until the maximum training epoch parameters: More details can be found in Appendix A. Fig. 2 illustrates the latent optimization process. Algorithm 1 shows the LOANT algorithm.

Understanding LOANT
To better understand the LOANT algorithm, we relate LOANT to the extragradient technique and Model-Agnostic Meta Learning (Finn et al., 2017). The vanilla gradient descent (GD) algorithm follows the direction along which the function value decreases the fastest. However, when facing an ill-conditioned problem like the one in Fig. 3, GD is known to exhibit slow convergence because the local gradients are close to being orthogonal to the direction of the local optimum.
For comparison with LOANT, we consider the extragradient (EG) method (Korpelevich, 1976;Az-izian et al., 2020) that uses the following update rule when optimizing the function f (w) with respect to w, Similar to LOANT, we can adopt a first-order approximation to EG if we set the Hessian term to zero in the total derivative. Instead of optimizing the immediate function value f (w), this method optimizes f (w − γ ∂f ∂w ), which is the function value after one more GD step. This can be understood as looking one step ahead along the optimization trajectory. In the contour diagrams of Fig. 3, we show the optimization of a 2-dimensional quadratic function. This simple example showcases how the ability to look one step ahead can improve optimization in pathological loss landscapes. We motivate the nested optimization of LOANT by drawing an analogy between EG and LOANT.
It is worth noting that LOANT differs from the EG update rule in important ways. Specifically, in EG the inner GD step and the outer GD step are performed on the same function f (·), whereas LOANT performs the inner step on L d and the outer step on L s or L t .
For a similar idea with multiple losses, we turn to MAML (Finn et al., 2017). In MAML, there are K tasks with losses L 1 , . . . , L k , . . . , L K . On every task, we perform a one-step SGD update to the model parameter w ∈ R L , After going through K tasks, the actual update to w is calculated using the parameters w T k , Utilizing the idea of look ahead, in MAML we update w so that subsequent optimization on any single task or combination of tasks would achieve good results. Adversarial neural transfer has three tasks, the source-domain and target-domain classifications and the negative discriminator loss. The updates performed by LOANT in Eq. 5 and 6 are similar to MAML's look-ahead update in Eq. 16. Specifically, when we update model parameters using the gradient from the total loss L LO , we prepare for the next descent step on −L d . Therefore, LOANT can (a) Vanilla gradient descent, which exhibits a zigzag trajectory. η = 0.025.
(c) Full-Hessian extragradient, which finds a direct path to the local minimum, enabling a large learning rate η = 0.1. be understood as forcing domain-specific losses to accommodate the domain discrimination loss and mitigating their competition.
LOANT differs from MAML since, in the inner update, LOANT updates the sentence-level features z s and z t instead of the model parameters w. As z s and z t are usually of much smaller dimensions than w, this leads to accelerated training and reduced memory footprint. For example, in the BERT-base model (Devlin et al., 2019), L is 110 million and D is 768. Within the regular range of batch size B, BD L. In the experiments, we verify the benefits of LOANT in terms of accuracy and time and space complexity.

Datasets
We conduct four cross-domain sarcasm detection experiments by transferring from an automatically collected dataset to a manually annotated dataset. The two automatically collected datasets include Ptáček (Ptáček et al., 2014) and Ghosh 1 (Ghosh and Veale, 2016), which treat tweets having particular hastags such as #sarcastic, #sarcasm or #not as sarcastic and others as not sarcastic. We crawled the Ptáček dataset using the NLTK API 2 according to the tweet ids published online 3 .
The two manually annotated datasets include (Oprea and Magdy, 2020). SemEval-18 consists of both sarcastic and ironic tweets supervised by third-party annotators and thus is used for perceived sarcasm detection. The iSarcasm dataset contains tweets written by participants of an online survey and thus is an example of intended sarcasm detection. Table 1 summarizes the statistics of the four datasets. The SemEval-18 dataset is balanced while the iSarcasm dataset is imbalanced. The two source datasets are more than ten times the size of the target datasets. For all datasets, we use the predefined test set and use a random 10% split of the training set as the development set.
We preprocessed all datasets using the lexical normalization tool for tweets from Baziotis et al. (2017). We cleaned the four datasets by dropping all the duplicate tweets within and across datasets, and trimmed the texts to a maximum length of 100. To deal with class imbalance, we performed upsampling on the target-domain datasets, so that both the sarcastic and non-sarcastic classes have the same size as source domain datasets.

Baselines
We compare LOANT with several competitive single-task and multi-task baselines.
MIARN (Tay et al., 2018): A state-of-the-art short text sarcasm detection model ranked top-1 on the iSarcam dataset. The model is a co-attention based LSTM model which uses the word embeddings pretrained on Twitter data 5 .
Dense-LSTM (Wu et al., 2018): A state-of-the-art single-task sarcasm detection model ranked top-1 on the SemEval-18 dataset. The model is a densely connected LSTM network consisting of four Bi-LSTM layers and the word embeddings pretrained on two Twitter datasets.

BERT:
We finetune the BERT model (Devlin et al., 2019) with an additional simple classifier directly on the target dataset.
S-BERT is a two-stage finetuning of the BERT model. We first finetune BERT on the source dataset and the best model is selected for further fine-tuning on the target dataset.

MTL:
We implemented a multi-task learning (MTL) model, which has the same architecture as LOANT except that the domain discriminator is removed. We use BERT as the shared text encoding network.

MTL+LO:
In this baseline, we applied latent optimization to MTL. As MTL does not have the adversarial discriminator, we use the domain-specific losses to optimize latent representations: We use the above to replace Equations 5 and 6 and keep the rest training steps unchanged. This model is compared against MTL to study the effects of LO in non-adversarial training for cross-domain sarcasm detection.
ANT: This is the conventional adversarial neural transfer model with the same architecture as LOANT. The only difference is that we do not apply latent optimization. For fair comparisons, we use BERT as the text encoder.
ANT+MAML: In Section 3.3, we discussed the similarity between LO and MAML. Therefore, we create a baseline that uses a MAML-like strategy for encouraging the collaboration of different loss terms. Instead of optimizing the latent representation z s and z t , we first take a SGD step in the parameter space of w b , After that, we use w b to compute the gradients used in the actual updates to all model parameters, including w b .

Experimental Settings
Model Settings. For all models using the BERT text encoder, we use the uncased version of the BERT-base model and take the 768-dimensional output from the last layer corresponding to the [CLS] token to represent a sentence. The BERT parameters are always shared between domains. For other network components, we randomly initialize the dense layers and classifiers. To minimize the effect of different random initializations, we generate the same set of initial parameters for each network component and use them across all baselines wherever possible. The source dense layer, the shared dense layer, and the target dense layer are single linear layers with input size of 768 and output size of 768 followed by the tanh activation. The classifier in all models consists of two linear layers. The first linear layer has input size of 768×2 (taking both shared and domain-specific features) and output size of 768 followed by the ReLU activation. The second linear layer has input size 768 and output size 2 for binary classification. After that we apply the softmax operation. More details can be found in Appendix B.
Training Setting. We optimize all models using Adam (Kingma and Ba, 2014) with batch size of 128. We tune the learning rate (LR) on the development set from 1e-5 to 1e-4 in increments of 2e-5. To objectively assess the effects of latent optimization (LO), we first find the best LR for the base models such as ANT and MTL. After that, with the best LR unchanged, we apply LO to ANT and MTL. We use the cosine learning rate schedule for all models. All models are trained for 5 epochs on Nvidia V100 GPUs with 32GB of memory in mixed precision. Due to the large model size and pretrained weights of BERT, 5 epochs are sufficient for convergence.
Evaluation Metrics. Following (Wu et al., 2018;Van Hee et al., 2018;Oprea and Magdy, 2020), we select and compare models using the F-score on the sarcastic class in each dataset. We additionally  (Wu et al., 2018) and ‡ in (Oprea and Magdy, 2020). . Table 2: Performance on the sarcastic class reported by single-task and multi-task models on the same test sets. The best performed F-score on the four groups of transfer learning are in bold. The best single task learning results are underlined.
report the corresponding Recall and Precision. In all our experiments, we use the development set for model selection and report their performance on the test set. To evaluate the efficiency of LOANT versus MAML-based training, we also compare their required GPU memory and average training time in each epoch. We compare models on the target domain datasets. Additional multi-domain performance can be found in Appendix C.

Comparison with the States of the Art
We compare LOANT with state-of-the-art methods on the SemEval-18 dataset (Van Hee et al., 2018) and the iSarcasm datast (Oprea and Magdy, 2020). Table 2 presents the test performance of LOANT and all baseline models. Our LOANT model consistently outperforms all single-task baselines by large margins. In particular, LOANT outperforms MIARN by 10.02% on iSarcasm (Oprea and Magdy, 2020) whereas the fine-tuned BERT achieved 1.48% lower than MIARN. On SemEval-18, the fine-tuned BERT achieves better test performance than other four single-task baselines. The results indicate that fine-tuning BERT, a popular baseline, does not always outperform the traditional LSTM networks specifically designed for the task. We hypothesize that the large BERT model can easily overfit the small datasets used, which highlights the challenge of sarcasm detection.

Transfer Learning Performance
The middle and bottom sections of Table 2 present the test performance of six transfer learning models (S-BERT, MTL, ANT, MTL+LO, ANT+MAML, and LOANT) under four groups of transfer learning experiments. These models generally outperform the single-task models, demonstrating the importance of transfer learning. Among these, we have the following observations.
Effects of the Domain Discriminator. The performance differences between MTL and ANT can be explained by the addition of the domain discriminator, which encourages the shared features under the source domain and the target domain to have the same distributions. In the four pairs of experiments, ANT marginally outperforms MTL by an average of 0.9% F-score. In the Ptáček → SemEval-18 experiment, the domain discriminator causes F-score to decrease by 0.56%. Overall, the benefits of the adversarial discriminator to transfer learning appear to be limited. As discussed earlier, the competition between the domain-specific losses and the negative domain discrimination loss may have contributed to the ineffectiveness of ANT.
Effects of Latent Optimization. We can observe the effects of LO by comparing ANT with LOANT and comparing MTL with MTL+LO. Note that in these experiments we adopted the best learning rates for the baseline models ANT and MTL rather than the latent-optimized models. On average, LOANT outperforms ANT by 3.42% in Fscore and MTL+LO outperforms MTL by 2.63%, which clearly demonstrates the benefits provided by latent optimization.
Latent Space vs. Model Parameter Space. In the ANT+MAML baseline, we adopt a MAMLlike optimization strategy, which performs the lookahead in the BERT parameter space instead of the latent representation space. Interestingly, this strategy does not provide much improvements and on average performs 1.40% worse than ANT. LOANT clearly outperforms ANT+MAML.
In addition, optimization in the latent space also provides savings in computational time and space requirements. Table 3 shows the time and memory consumption for different transfer learning methods. Adding LO to ANT has minimal effects on the memory usage, but adding MAML nearly doubles the memory consumption. On average, ANT+MAML increases the running time of LOANT by 3.1 fold.
The Influence of Domain Divergence. In transfer learning, the test performance depends on the similarity between the domains. We thus investigate the dissimilarity between datasets using the Kullback-Leibler (KL) divergence between the unigram probability distributions, where P s (g) and P t (g) are the probabilities of unigram g for the source domain and target domain respectively. V is the vocabulary. Table 4 shows the results. Ptáček is more similar to the two target datasets than Ghosh. Among the two target datasets, iSarcasm is more similar to Ptáček than SemEval-18. Comparing LOANT and ANT, we observe that the largest improvement, 7.85%, happens in the  Ptáček → iSarcasm transfer where domain divergence is the smallest. The Ptáček → SemEval-18 transfer comes in second with 3.54%. Transferring from Ghosh yields smaller improvements. Further, we observe the same trend in the comparison between MTL+LO and MTL. The largest improvement brought by LO is 6.12% in the Ptáček → iSarcasm transfer. As one may expect, applying LO leads to greater performance gains when the two domains are more similar.

Conclusion
Transfer learning holds the promise for the effective utilization of multiple datasets for sarcasm detection. In this paper, we propose a latent optimization (LO) strategy for adversarial transfer learning for sarcasm detection. By providing look-ahead in the gradient updates, the LO technique allows multiple losses to accommodate each other. This proves to be particularly effective in adversarial transfer learning where the domain-specific losses and the adversarial loss potentially conflict with one another. With the proposed LOANT method, we set a new state of the art for the iSarcasm dataset. We hope the joint utilization of multiple datasets will contribute to the creation of contextualized semantic understanding that is necessary for successful sarcasm detection.
By the same reasoning above, the total derivative of L LO against w b is For the rest of the parameters, the computation is slightly different as they do not contribute to z s and z t .
The parameter of the domain discriminator θ d is updated to minimize L d (z s , z t ). This is in contrast to the rest of the model, which minimizes −L d (z s , z t ). The update rule for θ d is

B Hyperparameters and Model Initialization
We set the batch size to 128 for all models and search for the optimal learning rate (LR) from 2e-5 to 1e-4 in increments of 2e-5 using the F-score on the development set. We show the best learning rates found in Table 5. The best learning rate for fine-tuning BERT on SemEval-18 and iSarcasm is 4e-5. S-BERT model is finetuned twice, first on the source domain and then on the target domain. Thus, we search for one best learning rate for each finetuning using the source and target development sets respectively. The best first-round LR is 6e-05 for Ptáče and 8e-5 for Ghosh.
Other models, MTL, ANT and the LO-adpated versions are selected using the target development set. For a rigorous comparison, we use the best LR for ANT when training LOANT and the best LR for MTL when training MTL+LO.
We follow the released code 6 to implement the Gradient Reversal Layer. It is controlled by a schedule which gradually increases the weight of the gradients from the domain discrimination loss.

C Source Domain Performance
The original goal of the paper is to use automatically collected sarcasm datasets, which are large but noisy, to improve performance on humanannotated datasets, which are clean and provide good performance measure. That is why we provided only the target domain performance.
Upon close inspection, LOANT also improves the performance on the source domain, even though    model selection was performed on the target domain. Table 6shows the results. In Table 7, we also show the results after model selection on both domains. Naturally, this might lead to slightly lowered target-domain performance than achieved by model selection on target domain only. Comparing LOANT with ANT, and MTL+LO with MTL, our results show that, in most cases, LO-based models improve both source and target domain F1. In particular, target domain F1 obtains more improvement than source domain F1. This suggests that LO provides benefits to knowledge transfer.