WIND: Weighting Instances Differentially for Model-Agnostic Domain Adaptation

Domain Adaptation is a fundamental problem in machine learning and natural language processing. In this paper, we study the domain adaptation problem from the perspective of instance weighting. Conventional instance weighting approaches cannot learn the weights which make the model generalize well in target domain. To tackle this problem, inspired by meta-learning, we formulate the domain adaptation problem as a bi-level optimization problem, and propose a novel differentiable model-agnostic instance weighting algorithm. Our proposed approach can automatically learn the instance weights instead of using manually designed weighting metrics. To reduce the computational complexity, we adopt the second-order approximation technique during training. Experimental results 1 on three different NLP tasks (Sentiment Classiﬁcation, Neural Machine Translation and Relation Extraction) illustrate the efﬁcacy of our proposed method.


Introduction
Domain shift is a challenging problem which is commonly encountered in Natural Language Processing (NLP). Due to the data distribution discrepancy between source and target domain, the model trained on the data from source domain may fail to achieve satisfying performance in target domain. Therefore we face the domain adaptation problem. In some real-world situations, we may only focus on the performance of our model on a specific domain. To maintain the performance, we need labeled training data for supervised learning. However, we often cannot collect enough labeled training data relevant to the domain we are interested in (in-domain). Thus, we need to introduce more labeled data from other different domains (out-of-domain). We aim to leverage the general knowledge from out-of-domain dataset to enhance the in-domain performance of our model.
We consider a specific domain adaptation scenario in this work, where we have a few labeled in-domain training data and meanwhile we have sufficient labeled out-of-domain training data from other general domains.
Training on these two datasets jointly is a straightforward solution for this scenario, but not all samples from out-of-domain dataset has equal effect during the training procedure. Several studies (Koehn and Knowles, 2017) on neural machine translation (NMT) task show that, out-of-domain instances relevant to the in-domain data are beneficial while the instances irrelevant to the in-domain data may be even harmful to the translation quality. Apart from that, for sentiment classification task, some general expressions such as "I'm truly impressed by the design." may appear in all domains. Taking them as training samples can help the model to learn general syntactic and semantic knowledge, which improves the cross-domain sentiment classification performance. But using examples like "This chair is solid." (negative sentiment, furniture domain) may reduce the accuracy of classifying "This knife is solid." (positive sentiment, kitchen domain), because "solid" has different meanings in these two domains. Any domain-specific expression like this would probably introduce some noise. So it is essential to find a suitable strategy to measure the importance of each training sample.
There are many instance weighting (or instance selection) methods to tackle this problem. They assign a weight to each instance and transform the loss function to a weighted-sum formula. Most of the conventional methods (Jiang and Zhai, 2007;Gretton et al., 2006Gretton et al., , 2009Axelrod et al., 2011;Wang et al., 2017;Zhang and Xiong, 2018;Wang et al., 2019;Dou et al., 2020) propose different kinds of manually designed metrics to calculate the weights of instances. The core idea of these methods is to weight the instances according to their importance and similarity to the target domain. However, in our domain adaptation setting, the size of out-of-domain corpus is much larger than that of in-domain corpus. The weights learned by the previous methods may be biased to the outof-domain data, which would unavoidably result in poorer performance on the in-domain data. In this paper, we seek to automatically learn the weights which make the model generalize well on the unbiased in-domain data.
Inspired by Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017), we introduce another unbiased subset from in-domain data which serves as a query set. We propose a novel modelagnostic differentiable instance weighting approach named "WIND" (means Weighting INstances Differentially) which is a general framework and can be applied to all tasks in our domain adaptation settings. Moreover, we hope to get rid of manually designed metrics and let the weights to be differentiable. To reduce the computational complexity, we adopt a second-order derivation approximation approach for calculating the gradient of weights. We conduct plenty of experiments on datasets from three representative NLP tasks: sentiment classification, machine translation and relation extraction. The results show that our proposed method substantially outperforms several strong baselines.
The contributions of our work can be summarized as follows: • We propose a novel differentiable instance weighting algorithm for domain adaptation, which learns the weights of instances with gradient descent and does not need manually designed weighting metrics.
• We adopt a second-order approximation technique to speed up the model training.
• We conduct experiments on three typical NLP tasks: Sentiment Classification, Machine Translation and Relation Extraction. Experiment results demonstrate the effectiveness of the proposed method. Code will be released.

Methodology
In this section, we first formulate our domain adaptation problem and introduce some notations. Then we present the proposed gradient-based modelagnostic instance weighting framework for our setting and introduce the method to approximate the second-order derivation of query loss. Finally, we discuss some optimization details of our method.

Problem Formulation
Let D train , D dev and D test denote our train, development and test datasets respectively. We use D train for model training, D dev for hyperparameter tuning and D test for model testing. Both D dev and D test are in-domain data. Differently, D train consists of sufficient labeled out-of-domain training samples How to efficiently utilize D in is the key to better domain transfer. To tackle this problem, in this paper we first sample an in-domain train sub- from D in and we assign a scalar weight w i to each instance (x i , y i ) ∈ D it ∪D out . We hope that during training, the model can find the optimal weight w = (w 1 , ..., w n 1 +m ) by itself. For this purpose, the weight w should be differentiable and can be optimized by gradient descent. Moreover, we denote the deep neural network (DNN) as a function f θ : X → Y which is parameterized by θ and maps x i from the input space to the label space. In our instance weighting setting, the training loss follows a weighted-sum formula: (1) where denotes the loss function, which can be any kind of loss such as cross entropy loss for classification tasks, or label-smoothed cross entropy loss for machine translation.
Jointly optimizing θ and w using Eq. 1 is a straightforward solution. However, due to the data distribution discrepancy of in-domain and out-of-domain datasets, learning w directly from D it ∪ D out by Eq. 1 may introduce bias. What we expect is that the model trained on w can be generalized to the in-domain data. In order to achieve this goal, inspired by MAML (Finn et al., 2017), we propose to sample another subset D q = {(x i , y i )} n 2 i=1 named query set from D in . We propose to use this query set to optimize w. Specifically, we aim to obtain a weight vector w which minimizes the loss on D q : Note that we only weight n 1 + m instances from D it ∪ D out , so L q (θ) has a standard form, not a weighted-sum form. Given a specific w, we can train a model with the loss L train (θ, w) and then get the optimized parameter θ * . We aim to minimize the loss on the query set given θ * . Therefore, our problem can be formulated as the following bilevel optimization problem (Colson et al., 2007): This bilevel formulation arises in many metalearning or hyperparameters optimization (HPO) problems (Bergstra et al., 2013;Franceschi et al., 2018), where the optimization of the outer objective L q depends on the optimization of inner objective L train . In fact, Eq. 3 is a special case of hyperparamter optimization, because w can be viewed as special hyperparameter of our model. In Section 2.2, we will introduce our proposed algorithm to solve this nested formulation.

Optimization of Instance Weights
It is difficult to directly solve the above-mentioned bilevel optimization problem because of its high complexity of solving the inner objective. There are many gradient-based methods (Maclaurin et al., 2015;Franceschi et al., 2018) to solve this problem. However, unlike typical hyperparameters such as learning rate, the instance weight w is of high dimension. It is even harder to optimize this problem in our setting.
Inspired by the optimization techniques used in model-agnostic meta-learning (MAML) (Finn et al., 2017), we split the training procedure of each iteration into the following three steps.

Pseudo Update
Firstly, we sample two mini-batches of data from D it ∪ D out and D q respectively, Then we compute the model's parameters after one step update by the gradient of L train (θ, w) respect to θ: where β denotes the learning rate of this step.
This step is just "pseudo update". After updating, we do not replace original parameters θ with the adapted parameters θ. Instead, we store both θ and θ. We will use θ to calculate the gradient of w in the second step. So in our proposed algorithm, θ is just an intermediate variable which will be abandoned in the end of current iteration.

Instance Weight Update
Then we calculate the instance weights w using θ. In this step, our goal is to find an optimal w * . We expect w * to have the property that: optimizing one step by L train (θ, w * ) should result in a decrease of query loss. In other words, we expect w * to minimize the loss on the query set after one step update: Note that this is an approximation for the outer objective of Eq. 3. Theorectically, we can perform gradient descent for many steps to find w * . But it is time-consuming. So basically we optimize w with the gradient of L q ( θ) with respect to w for only one step: where γ denotes the learning rate of w.
We take w as an approximation of w * . Using multiple gradient updates for w is a straightforward extension of this step, which will lead to more accurate approximation for w * while increasing the computational complexity at the same time.

Final Update
In the previous two steps, we have an approximately optimal weights w. We use it for actual update for θ: Current iteration ends after this step. As mentioned before, θ will be abandoned, but we can choose whether w to be abandoned or not. This will be further discussed in Section 2.4.3.

Second-Order Derivation Approximation
There is a fatal problem when calculating the gradient ∇ w L q ( θ) in the instance weight update (Sec-tion 2.2.2). We apply the chain rule to Eq. 6: We use |θ|, |w| to denote the dimensions of θ, w respectively. The second-order derivation ∇ 2 θ,w L train is a |θ| × |w| matrix which is too huge to calculate and store. Apart from that, calculating the matrix-vector product is also expensive. Precisely calculating the results is unrealistic. Fortunately, we can adopt the approximation technique used in DARTS (Liu et al., 2018) to solve this problem. This technique uses the finite difference approximation: where is a small scalar. We follow Liu et al.
(2018) to set = 0.01/ ∇ θ L q 2 which is accurate enough for approximation. Let α = βγ, we can adjust the learning rate of w by tuning α.
Calculating this approximated gradient needs only another two forward passes for θ + and θ − , which greatly accelerates the training procedure. More details about the training process are described in Algorithm 1.

Dataset Split Strategy
The data split of query set D q is critical. As mentioned in Section 2.1, we randomly sample D it and D q from in-domain training set D in . If we have enough in-domain data, D it and D q should be disjoint. However, our in-domain training set is not so large, and splitting it will make it even smaller. As a result, we use D in = D q = D it instead of sampling. The ablation studies about this issue are shown in Section 3.5.

Scaling the Weights
In this work, an extreme value of w i may make the training unstable. It is important to scale it to an appropriate range. In practice, we use sigmoid

.3 Initialization of Instance Weight
How to initialize w is an important issue. In this paper, we assume that all the training samples from in-domain training set D it are beneficial and should be highly weighted. For samples in D it , we fix their weights to a very large number at the beginning of training, which is close to 1 after calculating by the sigmoid function. For samples in D out , we initialize their weights all by zeros. During training, we do not optimize the weights of the in-domain training samples and only update the weights of the out-of-domain training samples.
Moreover, when to initialize w is another important issue. We propose two different kinds of initialization strategy. One is to initialize w at the beginning of each iteration. Another alternative is to initialize w at the beginning of the training, and update w in the storage every iteration. In practice, we choose the latter. Although the former is more easy to implement, it cannot make use of w from previous iterations.

Experiments
To evaluate the effectiveness of our proposed method introduced in Section 2 and demonstrate its model-agnostic property, we apply our method to    three different dataset settings of three tasks: Sentiment Classification, Machine Translation (MT) and Relation Extraction, respectively.

Datasets
For sentiment classification task, we conduct the experiments on the widely-used Amazon Review Dataset (Blitzer et al., 2007). This dataset contains four domain: books (B), dvd (D), electronics (E) and kitchen (K). Each domain contains the reviews of a specific category of products. We use the data processed by He et al. (2018) and collect 6000 labeled samples for each domain. We split the data of each domain into training (D in ), development (D dev ) and test (D test ) set. In each domain adaptation setting, we choose the training data of one domain as the in-domain data (D in ) and all data of other three domains as the out-of-domain data (D out ). Table 1 shows an example with books domain as the in-domain.  (Cettolo et al., 2016) as the in-domain data. This corpus contains about 202K sentences from TED talks. For out-ofdomain data, we randomly sample a subset of 500K sentences from the WMT 2014 English-German corpus. Table 2 show the statistics of the datasets.
For relation extraction task, we evaluate our method on the ACE 2005 dataset. This dataset is suitable for evaluating domain adaptation because it contains six different domains. It has been adopted by many previous works (Nguyen and Grishman, 2014;Gormley et al., 2015;Fu et al., 2017) for cross-domain relation extraction. In this work, we take broadcast news (bn) and newswire (nw) domain as out-of-domain, and split broadcast conversation (bc) domain into train/dev/test sets with the ratio of 1 : 1 : 4. Table 3 shows the detailed statistics.

Implementation Details
For sentiment classification task, we use the pretrained BERT-base-uncased (Devlin et al., 2018) model provided by HuggingFace (Wolf et al., 2019) as our feature extractor. Our sentiment classifier is a one-hidden-layer MLP with ReLU as the activation function. For the optimization of model parameters θ, we use the AdamW (Loshchilov and Hutter, 2018) as the optimizer with a learning rate of 2e − 5, a warmup of 0.1 (of the total steps) and a linearly decayed learning rate scheduler. The computational cost is about 8-12 GPU hours on Tesla V100.
For machine translation, we choose a vanilla Transformer (Vaswani et al., 2017) as our backbone. We implement some baseline methods and our method via fairseq toolkit (Ott et al., 2019). We use MOSES 2 scripts to tokenize the English and German sentences, and then we apply Byte Pair Encoding (BPE) (Sennrich et al., 2015) algorithm to split the words into subwords. We limit the maximum length of the sentences to 250 subwords. We choose to share the embeddings of English and German with the vocabulary size of 32,000. We use Adam (Kingma and Ba, 2014) as the optimizer and a decayed learning rate of 7e − 4.
For relation extraction, we only focus on relation classification when the entity pairs are given for simplicity. We use the RBERT (Wu and He, 2019) model as our backbone. The configurations of optimizer and learning rate are the same as those in our sentiment classification experiments.

Baselines
We implemented the following baseline methods for comparison with our methods. It's worth noting that we don't choose some baselines (Jiang and Zhai, 2007;Wang et al., 2017) (Fu et al., 2017) 89.38 WIND (ours) 90.54 ing because they are quite early work and it's unfair to compare with them. For sentiment classification:

of instance weight-
• In A pre-trained BERT only fine-tuned on the in-domain training set.
• Out A pre-trained BERT only fine-tuned on the out-of-domain training set.
• In+Out A pre-trained BERT fine-tuned on both in-domain and out-of-domain data.
• Ensemble It ensembles the in model and the out model by adding their predictions. Note that this method is used as a baseline in Wang et al. (2017). Although Wang et al. (2017) conducted the experiments on machine translation, we can still adopt this method on sentiment classification task.
• IW-Fit It uses the weighting strategy proposed by Wang et al. (2019) for domain transfer.
• DANN It introduces the domain classifier and adversarial training as proposed by Ganin et al. (2016).
For machine translation, the meanings of In, Out and In+Out is the same as those in the sentiment classification setting. There are some other baselines for machine translation setting: • DM This indicates the Discriminative Mixing method proposed by Britz et al. (2017), which adds a domain classifier to the encodings of source sentences similar to DANN (Ganin et al., 2016).
• IDDA This indicates Iterative Dual Domain Adaptation methods proposed by Zeng et al. (2019), which iteratively performs bidirectional translation knowledge transfer using knowledge distillation between in-domain and out-of-domain. Note that this method focuses on the performance of both domains but in this paper we only focus on in-domain performance.
For relation extraction, besides the In, Out and In+Out approaches, we also choose Fu et al. (2017) as our baseline. This method simply introduces DANN (Ganin et al., 2016) to cross-domain relation extraction. Note that it is implemented by convolutional neural network, we reimplement a RBERT (Wu and He, 2019) version of it. Table 4 shows the overall performance of our methods in the domain adaptation setting on the sentiment classification task. Our method achieves an absolute improvement of 0.45, 0.40, 0.52 and 0.88 points on four settings respectively in comparison to the In+Out baseline. Moreover, our method outperforms all the domain adaptation methods on the settings with B, D, K as the in-domain data except the E domain. Although our method does not beat all baselines on all settings, it achieves the best average performance across the four settings. On average, we achieve an improvement of 0.28 point over DANN and 0.54 point over In+Out. Table 5 shows the performance for the machine translation task. We use BLEU (Papineni et al., 2002) scores to measure the performance. Our method beats all baselines on all test sets (tst2013,tst2014) and the development set (tst2012). On these three datasets, we observe an improvement of 1.17, 1.08 and 0.54 BLEU points compared to In+Out. On tst2013 and tst2014 test sets, we also achieve an improvement of 0.65 and 0.38 BLEU points compared to IDDA (Zeng et al., 2019) method. Table 6 further shows our method's effectiveness on the relation extraction task.

Experiment Results
Furthermore, from the results in Tables 4, 5 and 6, we can make the following observations: (1) In comparison to the method in (Wang et al., 2019) which uses manually designed weighting metrics, our differentiable weighting approach outperforms it in the sentiment classification task. This result demonstrates that designing the metrics manually may not be the best solution for all the tasks. Designing them requires many prior human expert knowledge which is hard to generalize well across tasks. By contrast, our method can learn instance weights with the help of meta-learning based algorithm to improve the models' in-domain generalization capability.
(2) Domain adversarial based method is a strong baseline which is surpassed only by our method in the sentiment classification task and the relation extraction task. However, it performs not so well for the machine translation task. The potential reason may be that Britz et al. (2017) introduces the domain classifier after the encoder to learn domain-invariant features of sentences from source language, but both domains share the same decoder which cannot discriminate the features encoded by the encoder. In other words, this type of method may only pay attention to the encoder and ignore the domain transfer of decoder. In contrast, our method overcomes this problem by considering weighting the loss of the whole model and thus achieves better performance.
(3) For all three tasks, adding out-of-domain corpus to the training set will improve the overall performance. We believe that adding the data of some general domains can help the model better learn domain-invariant syntactic and semantic  Table 7: The comparison between different variants of our method on the sentiment classification task. Note that the results in this table are evaluated under |D in | = 1, 000 setting. "+rand" means randomly initializing w. "+split" means splitting D in into disjoint D it and D q . "init" means not assigning a large number to in-domain data weights.
knowledge, so it can improve the performance on the in-domain data. This is consistent with the conclusions reached by transfer learning. Interestingly, this result is contradictory to the observation of Wang et al. (2017), whose experiment results show that adding out-of-domain to in-domain data degraded machine translation performance. We suspect that there is a problem with their training strategy. The hyperparameters required under each setting may be different. Some hyperparameters that are set incorrectly (e.g. the same as In) may make the result of In+Out even worse. Another reason may be that the RNN-based sequence-to-sequence NMT system they used tends to be more sensitive to the noise while the Transformer (Vaswani et al., 2017) model we used is more robust. All in all, as we expected, our proposed method WIND achieves the best performance under the three task settings. This illustrates the advantages of using differentiable method for data weighting.

Ablation Study
In this part, we study the effect of the strategies mentioned in Section 2.4.1 and Section 2.4.3. The experiment results shown in Table 7 demonstrate that: (1) Assigning the weights of in-domain instances to a large number (1e8) and fixing them during training can improve the accuracy. The weights of this part do not actually need to be learned. Fixing them may reduce the interference to the learning process from the out-of-domain data.
(2) Zero initialization for weights of out-ofdomain instances is better than the random initialization. The underlying reason for this may be that random initialization may easily make the model stuck into a local minima.
(3) No splitting for D in can improve the performance as well. Intuitively, this improvement comes from the increased size of in-domain training set, which enables us to make more use of the scarce in-domain training samples.

Effect of In-Domain Dataset Size
In this part, we aim to study the impact of indomain dataset size n. Besides n = 500 setting, we sample another two different D in with n = 100 and n = 1000. The rest of in-domain data is used as development set. We evaluate all three settings on the same D test mentioned in Table 1. Figure 1 shows the average accuracy over four domain settings when using different domain adaptation methods. We found that DANN (Ganin et al., 2016) may not perform so well when in-domain data are scarce. But our method can still achieve consistent improvements in this three dataset size settings.

Domain Adaptation
Domain Adaptation is a fundamental problem in machine learning and NLP. We aim to train a wellperforming model on a source domain which can be generalized to a target domain.
The basic idea for domain adaptation is to learn domain-invariant representations which generalize across the domains. To achieve this, the most prevailing method Domain Adversarial Neural Network (DANN) (Ganin et al., 2016;Qu et al., 2019;Xue et al., 2020) introduces a domain classifier and uses adversarial training to make the features unable to discriminate between source and target domains. This method has been applied to many NLP tasks. However, out-of-domain data is far more than in-domain data in our setting. DANN may cause some bias in this unbalanced dataset. Another type of methods (Fang and Xie, 2020;Li et al., 2020) propose to learn domain-general representations by contrastive learning (Chen et al., 2020a;Chen et al., 2020b). But they mainly focus on classification task and the methods are not model-agnostic frameworks.

Cross-Domain Sentiment Classification
Sentiment classification task aims to automatically classify the sentiment polarity of the given texts. Cross-domain sentiment classification aims to generalize the sentiment classifier from source domain to target domain.
Besides the domain adaptation methods introduced in Section 4.1, there are some methods which are specific for cross-domain sentiment classification. An important line of works follow the Structural Correspondence Learning (SCL) (Blitzer et al., 2006), and they design an auxiliary task called pivot prediction to transfer domain-invariant knowledge (Pan et al., 2010;Yu and Jiang, 2016;Ziser and Reichart, 2016. But the pivot words need human knowledge to select, which may be not so accurate. Recently, the pretrained language models such as BERT (Devlin et al., 2018) have achieved state-of-the-art on many NLP tasks. DAAT (Du et al., 2020) performs a novel post-training procedure on BERT and uses adversarial training to transfer domain knowledge. But this method only works for classification task while our method is model-agnostic and does not need two-stage post-training and fine-tuning.

Meta-Learning
The goal of meta-learning is to train a model that can adapt to a new task quickly given a few new samples. In other words, meta-learning can learn the initial value of the model that is close to the optimums of many different tasks. MAML (Finn et al., 2017) is a classical method for meta-learning. Each entry of the meta-training set of MAML is a subset contains training data (support set) and test data (query set). MAML calculates the loss on the query set based on the parameters after one-step optimization on support set, and uses the gradient of this loss to update the model parameters. MAML has also been adopt for natural language understanding task before (Dou et al., 2019). Despite our domain adaptation setting is quite different from that in MAML, we can still utilize the idea of their work to help domain generation.

Conclusion
In this paper, we propose WIND, a differentiable instance weighting method for model-agnostic domain adaptation, which is inspired by the ideas of meta-learning to learn the weights on the in-domain query set. Experiment results on three typical NLP tasks show the efficacy of our framework.
It remains an open question how to efficiently transfer the domain knowledge. In the future, we plan to evaluate our method on more different tasks.