To Share or not to Share: Predicting Sets of Sources for Model Transfer Learning

In low-resource settings, model transfer can help to overcome a lack of labeled data for many tasks and domains. However, predicting useful transfer sources is a challenging problem, as even the most similar sources might lead to unexpected negative transfer results. Thus, ranking methods based on task and text similarity — as suggested in prior work — may not be sufficient to identify promising sources. To tackle this problem, we propose a new approach to automatically determine which and how many sources should be exploited. For this, we study the effects of model transfer on sequence labeling across various domains and tasks and show that our methods based on model similarity and support vector machines are able to predict promising sources, resulting in performance increases of up to 24 F1 points.


Introduction
For many natural language processing applications in non-standard domains, only little labeled data is available. This even holds for high-resource languages like English (Klie et al., 2020). The most popular method to overcome this lack of supervision is transfer learning from high-resource tasks or domains. This includes the usage of resources from similar domains (Ruder and Plank, 2017), domainspecific pretraining on unlabeled texts (Gururangan et al., 2020), and the transfer of trained models to a new domain (Bingel and Søgaard, 2017). While having the choice among different possible transfer sources can be advantageous, it becomes more challenging to identify the most valuable ones as many sources might actually lead to negative transfer results (Pruksachatkun et al., 2020).
Current methods to select transfer sources are based on text or task similarity measures (Dai et al., 2019;Schröder and Biemann, 2020). The underlying assumption is that similar texts and tasks can support each other. An example for similarity based on vocabulary overlap is shown in Figure 1. However, current methods typically consider text and task similarity in isolation, which limits their application in transfer settings where both the task and the text domain changes.
Thus, as a first major contribution, this paper proposes a new model similarity measure based on mappings between two neural models that is able to capture similarity between domain-specific models across tasks better than existing approaches by considering text and task similarity jointly. We perform experiments for different transfer settings, namely unsupervised model transfer, supervised domain adaptation and cross-task transfer across a large set of source domains and tasks. Our newly proposed similarity measure successfully predicts the best transfer sources and outperforms existing text and task similarity measures.
As a second major contribution, we introduce a new method to automatically determine which and how many corpora should be used in the transfer process, as the transfer can benefit from multiple sources. This overcomes the limitations of current transfer methods, which solely predict single sources based on rankings. We show the benefits of transfer from sets of sources and demonstrate that support vector machines are able to predict the best sources across domains and tasks. This improves performance with absolute gains of up to 24 F 1 points and effectively prevents negative transfer.

Related Work
Domain adaptation & transfer learning are typically performed by transferring information and knowledge from a high-resource to a low-resource dataset (Daumé III, 2007;Ruder, 2019). Recent approaches can be divided into two groups: (i) model transfer (Ruder and Plank, 2017) by reusing trained task-specific weights (Vu et al., 2020) or by first adapting models on the target domain before training the downstream task (Gururangan et al., 2020;Rietzler et al., 2020); (ii) multi-task training (Collobert and Weston, 2008) where multiple tasks are trained jointly by learning shared representations (Peng and Dredze, 2017;Meftah et al., 2020). We follow the first approach in this paper.
For transfer learning, the selection of sources is utterly important. Corpus and model similarity measures (Ruder and Plank, 2017;Bingel and Søgaard, 2017) are used to select the best sources for cross-task transfer (Jiang et al., 2020), multitask transfer (Schröder and Biemann, 2020), crosslingual transfer (Chen et al., 2019) and language modeling (Dai et al., 2019). Alternatively, neural embeddings for corpora can be compared (Vu et al., 2020). In prior work, the set of domains is usually limited and the focus is on the single-best source. In contrast, we exploit a larger set of source domains and also explore the prediction of sets of sources, as exploiting multiple sources is likely to be even more beneficial.

Terminology
We consider two dimensions of datasets: the task T , which defines the label set, and the domain of the documents D. We thus define a dataset as a tuple T, D , and specify in our experiments which of the two dimensions are changed.

Similarity Measures
We apply the following measures to rank sources according to their similarity with the target data.
Baselines. We use the most promising domain similarity measures reported by Dai et al. (2020), namely Vocabulary and Annotation overlap, Language model perplexity (Baldwin et al., 2013), Datasize (Bingel and Søgaard, 2017) and Term distribution (Ruder and Plank, 2017). Neural methods can compute domain similarity using Text embeddings or task similarity using Task embeddings (Vu et al., 2020).
Model similarity. As a new strong method, we propose Model similarity that is able to combine both, domain and task similarity. For this, the feature vectors for the target dataset are extracted from the last layer of two trained models and aligned by a learned mapping between the feature spaces using the Procrustes method (Schönemann, 1966). The distance between both models is the difference of the mapping and the identity matrix I, where a small deviation indicates similarity. Similar mappings have been used for the alignment of different embedding spaces (Mikolov et al., 2013;Artetxe et al., 2018) since they inherently carry information on the relatedness between models.

Prediction Methods for Sets of Sources
While these similarity measures can be applied for creating rankings and selecting similar datasets, they still have a major shortcoming in practice: None of them provides explicit insights when positive or negative transfer can be expected.
Typically, the most similar source is selected for training based on a given similarity measure. We call this method Top-1. This might introduce only a low risk of selecting a negative transfer source, but it also cannot benefit from other positive transfer sources. Thus, we also test its extension to an arbitrary selection of the n best sources denoted by Top-n. However, it is unclear how to choose n, and increasing n comes with the risk of including sources that lead to negative transfer gains.
As a solution, we propose a method that predicts whether positive transfer is likely for a given distance between datasets. We model this either as a 3-class classification task for positive, neutral and negative transfer, or as a regression task predicting the transfer gain. We introduce the neutral class to cope with small transfer gains |g| < t with t being a predefined threshold (t = 0.5 in our experiments). We propose to use support-vector machines (SVM) for classification (-C) and regression (-R) and compare to k-nearest-neighbour classifiers (k-NN), logistic and linear regression. 1 Given the model similarity between source and target, the models predict which kind of transfer can be expected. 2 The predictions for a target and a set of sources can be then used to select the subset of sources with expected positive transfer. We perform experiments on 33 datasets for three tasks: Named entity recognition (NER), part-ofspeech tagging (POS), and temporal expression extraction (TIME), as well as cross-task transfer, including NER with different label sets. Details about the datasets, their domain and size are provided in the appendix. The metric for all experiments is micro F 1 . We use the difference in F 1 to measure transfer effects and also report transfer gain (Vu et al., 2020), i.e., the relative improvement of a transferred model compared to the single-task performance.
In Section 5.2, we rank sources according to their similarity to the target. These rankings are evaluated with two metrics, following Vu et al. (2020): (1) the average rank of the best performing model in the predicted ranking denoted by ρ and (2) the normalized discounted cumulative gain (NDCG). The latter evaluates the complete ranking while ρ only considers the best-performing element.

Sequence Tagger Model
For sequence tagging, we follow Devlin et al. (2019) and use BERT-base-cased as the feature extractor and a linear mapping to the label space followed by a softmax as the classifier. Hyperparameters, training details and single-task performances are provided in the appendix.

Transfer Settings
Unsupervised model transfer. We apply a model trained on a source dataset to a target with the same task but a different domain: Supervised domain adaptation. A model trained on a source domain is adapted to a target domain by finetuning its weights on target training data: Cross-task transfer. As trained models cannot be directly applied to different tasks, we change the classification layer by a randomly initialized layer for the target task and adapt the model to the new target task:

Results
This section presents the results of the different transfer settings and analyzes how similarity measures can be used to predict transfer sources. Table 1 shows the observed performance gains compared to the single-task performance. For unsupervised model transfer, we observe severe performance drops. In addition to domain-specific challenges, this setting is impaired by differences in the underlying annotation schemes. 4 Thus, unsupervised model transfer is only advisable when no labeled data is available in the target domain.

Transfer Performance
Supervised domain adaptation, i.e., adapting a model to the target domain, improves performance across all settings independent of the source domain. Table 1 shows that the average transfer gains are positive for all tasks and that the maximum transfer gain is 32.7 points for TIME.
The gains for cross-task transfer are smaller than for supervised domain adaptation. While we still observe some performance increases, the average transfer gains are negative for all tasks. This shows that it is likely that the adaptation of models from other tasks will decrease performance. These results demonstrate the need for reliable similarity measures and methods to predict the expected transfer gains given the source task and domain. We will explore them in the following Sections 5.2 and 5.3.

Similarity-based Ranking
To evaluate the prospects of different sources for model transfer, we compute the pairwise distances between all datasets using the similarity measures presented in Section 3.2 and rank them accordingly.  Table 2: Ranking results for different similarity measures in the three transfer settings. The values displayed are the average rank of the best model (ρ) and the NDCG-score (N) compared to the performance. Table 2 shows that the text-based methods vocabulary and annotation overlap are most suited for in-task transfer, i.e., model transfer and domain adaptation, while our model similarity is most useful for cross-task transfer. This shows that task similarity alone is not the most decisive factor for predicting promising transfer sources and domain similarity is equally or even more important, in particular, when more distant domains are considered. Our model similarity is able to capture both properties and as a result, outperforms the task embedding in the cross-task setting and performs comparable to the text-based methods in the in-task settings. It is the best similarity measure on average across all transfer settings according to the predicted rank of the top-performing source (ρ) and the best neural method according to the NDCG score.
In general, we find that selecting only the top source(s) based on a ranking from a distance measure, as done in current research, gives no information on whether to expect positive transfer. Thus, we now explore methods to automatically predict sets of promising sources.

Prediction of Sets of Sources
We use the methods introduced in Section 3.3 to predict the set of most promising sources. Then, we train a model on the combination of the selected sources and adapt it to the target. 5 The results averaged across the different settings are visualized in Figure 2. While NER and TIME targets benefit from training on many sources, POS tagging targets gain the most from using only one or two of the most related source domains. We find that our methods based on SVMs are able TIME 5.6 5.8 6.8 8.1 11.5 8.1 7.7 10.7 8.8 NER 4.6 5.6 7.5 9.7 9.7 9.7 9.2 9.7 9.7 to predict this behavior and assign fewer sources for POS targets, and more sources for TIME and NER settings. In particular, for TIME settings, our methods SVM-C and -R result in much higher transfer gains compared to the static ranking-based methods and other classifiers or regression models. For example, transferring multiple sources using our SVM classifier to the ACE-UN target increases performance from 60.5 F 1 for single-task training to 84.5 F 1 (+24.0), which is much higher than the 10.9 points increase when using the single best source or 10.2 points using all available sources. More information can be found in the appendix. For the cross-task experiments in the lower part of Figure 2, we find that even the inclusion of the single best-ranked model results in a transfer loss of -0.9 points on average for TIME→NER. In this setting, our models correctly adapt to this new challenge and predict an empty set of sources, indicating that no transfer should be performed.

Conclusion
We explored different transfer settings across three sequence labeling tasks and various domains. Our new model similarity measure based on feature mappings outperforms currently used similarity measures as it is able to capture both task and domain similarity at the same time. We further addressed the automatic selection of sets of sources as well as the challenge of potential negative transfer by proposing a selection method based on support vector machines. Our method results in performance gains of up to 24 F 1 points.  Table 3 with information on their domain and size with respect to the label set and number of sentences in the training, development, and test splits. We take the last 20% of the training data as test data whenever no test set was provided and similarly 10% for the development split.

B Hyperparameters
We use the BERT-base-cased model with a linear layer on top. This model consists of 12 transformer encoder layers with 768 dimensions and has roughly trainable 110M parameters. Models are trained using the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 2e − 5. The training is performed for a maximum of 100 epochs. We apply early stopping after 5 epochs according to the F 1 -score on the development set. We use the same hyperparameters across all settings. The training of a single model takes between 5 minutes and 8 hours depending on the dataset size on a single Nvidia Tesla V100 GPU with 32GB VRAM. All our experiments are run on a carbon-neutral GPU cluster.

C Similarity Measures
This section provides a more detailed overview of the similarity measures introduced in Section 3.2.
Target vocabulary overlap is the percentage of unique words from the target corpus covered in the source corpus. In contrast to pure vocabulary overlap, this is a asymmetric measure. Annotation overlap is a special case where only annotated words are considered.
We also experiment with the Language model perplexity (Baldwin et al., 2013) between two datasets. For this, a language model, in our case a 5-gram LM with Kneser-Ney smoothing (Heafield, 2011) as done by (Dai et al., 2019), is trained for each source domain and tested against the target domain. The resulting perplexity gives hints how similar these domains are, i.e., a lower perplexity indicates similarity between domains.
Jensen-Shannon divergence (Ruder and Plank, 2017) compares the term distributions between two texts, which are probability distributions that also capture the frequency of words. It is similar to vocabulary overlap, as it describes the textual overlap, but is based on probability distributions and not sets of terms.
A Text embedding (Vu et al., 2020) can be computed by extracting the feature vectors of a BERT model. For this, the output of the last layer is averaged over all words in the dataset. The averaged vector can then be used as a representation of the text domain. The distance between two vectors is computed by using cosine similarity.
The Task embedding (Vu et al., 2020) takes a labeled source dataset and computes a representation based on the Fischer Information Matrix, which captures the change of model parameters w.r.t to the computed loss. This method assumes that for similar tasks also similar parameters have to be changed. We use the released code from Vu et al.
(2020) to compute task embeddings from the different components of our BERT models and similarly use reciprocal rank fusion (Cormack et al., 2009) to combine these.

D Model Performance
We list the performance for all single-task models (Table 4), as well as the transfer models (Tables 8 to 25) in the following. We show the F 1 -score and the corresponding transfer gain, which are described in the main paper Section 5. For POS-tagging, precision, recall, F 1 -score and accuracy are the same due to the absence of a non-labeled class like 'O' which is ignored for NER evaluation.

E Analysis of Multi-Source Predictors
We show the predicted sources and the corresponding transfer gains for the multi-source predictions methods in Table 7