Meta Distant Transfer Learning for Pre-trained Language Models

With the wide availability of Pre-trained Language Models (PLMs), multi-task fine-tuning across domains has been extensively applied. For tasks related to distant domains with different class label sets, PLMs may memorize non-transferable knowledge for the target domain and suffer from negative transfer. Inspired by meta-learning, we propose the Meta Distant Transfer Learning (Meta-DTL) framework to learn the cross-task knowledge for PLM-based methods. Meta-DTL first employs task representation learning to mine implicit relations among multiple tasks and classes. Based on the results, it trains a PLM-based meta-learner to capture the transferable knowledge across tasks. The weighted maximum entropy regularizers are proposed to make meta-learner more task-agnostic and unbiased. Finally, the meta-learner can be fine-tuned to fit each task with better parameter initialization. We evaluate Meta-DTL using both BERT and ALBERT on seven public datasets. Experiment results confirm the superiority of Meta-DTL as it consistently outperforms strong baselines. We find that Meta-DTL is highly effective when very few data is available for the target task.


Introduction
Owning to the availability of Pre-trained Language Models (PLMs), the performance of various text classification tasks has been significantly improved. Notable PLMs include BERT (Devlin et al., 2019), ALBERT (Lan et al., 2020), XLNet , T5 (Raffel et al., 2020), GPT-3 (Brown et al., 2020) and many others. It can be safely concluded that PLM-based approaches achieve state-of-the-art results for a majority of text classification tasks.
Among these methods, a key procedure of PLMs is fine-tuning, which enables parameters of PLMs to fit specific datasets. Hence, the performance of PLMs on a downstream task may be limited by * Corresponding author. the availability of the training set. As reported by several benchmarks such as GLUE (Wang et al., 2019b) and SuperGLUE (Wang et al., 2019a), some PLMs may not perform well in low-resource tasks. A popular solution in NLP is transfer learning (Zhuang et al., 2019;Alyafeai et al., 2020). For PLMs, these models can be fine-tuned over both source-domain and target-domain datasets by various multi-task training strategies Arase and Tsujii, 2019). Unfortunately, several studies reveal that multi-task training of PLMs across domains does not always guarantee satisfactory results (Sun et al., 2019;. As PLMs usually have large parameter space and strong memorization power, learning from sourcedomain datasets may force PLMs to memorize nontransferable knowledge of source domains, leading to the negative transfer effect (Wang et al., 2019d). Besides, a large number of transfer learning algorithms address tasks across similar sub-domains, with the same set of class labels. 1 When there exist large domain gaps and class label differences, these transfer learning solutions are likely to fail. Consider a simple, motivation example in Figure 1. De-spite the fact that the two datasets (SST-5 (Socher et al., 2013) and Amazon Reviews (Blitzer et al., 2007)) are diverse in domains and classification targets, they aim to solve similar review analysis tasks. It would be beneficial for the two task-specific models to learn from each other. A few methods address the distant domain issue in transfer learning (Tan et al., 2017;Xiao and Zhang, 2020), but are not designed for PLMs. A natural question arises: how can we transfer knowledge across distant domains with different classification targets for PLM-based text classification?
Recently, meta-learning has been studied extensively, which learns parameters that can be adapted to a group of similar tasks (Finn et al., 2017(Finn et al., , 2018. For PLMs,  suggest that training a meta-learner for PLMs is highly effective to capture transferable knowledge across sub-domains. However, this method is not designed for tasks across diverse domains and class label sets. Additionally, it lacks the mechanism to learn taskagnostic representations and may fit too much to specific targets in certain datasets. To this end, the Meta Distant Transfer Learning (Meta-DTL) framework is proposed 2 . Specially, Meta-DTL employs a task representation learning procedure to obtain a collection of prototype vectors for each task. To understand how to transfer across these tasks and classes, we construct a Meta Knowledge Graph (Meta-KG) to characterize the implicit relations among tasks and classes, based on the representations of multiple tasks. The meta-learner in Meta-DTL can be initialized by any PLMs and trained by multi-task learning with rich meta-knowledge injected from Meta-KG. Additionally, we design the Weighted Maximum Entropy Regularizers to make the model more task-agnostic and unbiased. Finally, the metalearner can be fine-tuned to fit each task using its own training set. In this way, the model is able to digest the cross-task, transferable knowledge and alleviates negative transfer.
We apply Meta-DTL to BERT (Devlin et al., 2019) and ALBERT (Lan et al., 2020) for three sets of NLP tasks (seven public datasets in total): i) coarse and fine-grained review analysis across 2 We name our algorithm to be Meta Distant Transfer Learning because it is inspired by the idea of meta-learning to capture the cross-task knowledge for task adaptation. We would like to clarify that Meta-DTL is used in a distant transfer learning setting for PLMs, instead of the traditional K-way N-shot setting in meta-learning (Finn et al., 2017(Finn et al., , 2018 domains; ii) natural language inference (across sentence relation prediction and scientific question answering); and iii) lexical semantics (across hypernymy detection and lexical relation classification). Experiments show that Meta-DTL consistently outperforms strong baselines. We also show that Meta-DTL is highly useful for text classification when very few training samples of the target task are available.

Related Work
In this section, we summarize the related work on PLMs, transfer learning and meta-learning.

Pre-trained Language Models
PLMs have brought NLP to a new era, pushing the performance of various NLP tasks to new heights . Among these models, ELMo (Peters et al., 2018) employs BiLSTM to learn context-sensitive embeddings from both directions. BERT (Devlin et al., 2019) is one of the most popular PLMs that learns language representations by transformer encoders. ALBERT (Lan et al., 2020) employs several parameter sharing and factorization techniques to reduce the sizes of BERTstyle models. Other transformer encoder-based architectures include Transformer-XL , XLNet , Big Bird (Zaheer et al., 2020), etc. The encoder-decoder architectures are used in T5 (Raffel et al., 2020) and GPT-3 (Brown et al., 2020), which are ultra-large PLMs with 11 billion and 175 billion parameters, respectively. PLMs can also be pre-trained by supervised tasks, such as MT-DNN . Apart from pre-training PLMs, a few works focus on fine-tuning, such as Sun et al. (2019); Cui et al. (2019); Zhao and Bethard (2020); . Different from these works, we pay attention to transferring knowledge across distant domains for PLMs.

Transfer Learning
Transfer learning is a widely used paradigm for transferring resources from source domains to target domains (Pan and Yang, 2010;Lu et al., 2015;Zhuang et al., 2019;Wang et al., 2019c). For deep neural networks, it is common practice to learn similar tasks by multi-task learning. Among these methods, the "shared-private" architectures are frequently applied (Liu et al., 2017;Chen et al., 2018;, consisting of task-specific sub-

Prototype-based Representations
Pre-trained Task Encoder  networks and a shared sub-network. Meta-DTL transfers knowledge across tasks from a different perspective, which captures transferable knowledge by the meta-learner and passes the knowledge to task-specific models by fine-tuning.

Meta-learning
Meta-learning aims to train meta-learners that can quickly adapt to different tasks with little training data (Vanschoren, 2018). Typically, meta-learning is applied in few-shot learning as a K-way N-shot problem, such as few-shot image classification in computer vision Afrasiyabi et al., 2020). In NLP, applications that employ meta-learning include few-shot link prediction in knowledge graphs , relation classification (Ye and Ling, 2019; Wang et al., 2021b), natural language generation , question answering (Hua et al., 2020), named entity recognition (Yang and Katiyar, 2020), text classification (Bao et al., 2020;Wang et al., 2021a) etc. In contrast, Meta-DTL is not a typical K-way Nshot algorithm. Similar to ; Pan et al. (2021), it leverages the techniques of metalearning to obtain the meta-learner, which is better at learning knowledge across tasks.

Meta-DTL: The Proposed Framework
In this section, we first present our task. After that, the technical details of the Meta-DTL framework are elaborated.

Task Overview
Let T 1 , · · · , T K be K text classification tasks, with the corresponding training sets denoted as D 1 , · · · , D K . In the single-task setting, the goal of the task T i is to learn a model from D i to map its input instance to one of the class labels in C i , where C i is the class label set of T i . Apart from the domain differences, we consider the situation where the class label sets may also be different. Formally, among the K tasks, there exists at least one task pair (T i and T j ) such that the lassociated abel sets C i = C j . 3 Re-consider the example in Figure 1. The goal is to classify reviews into a 5-point rating scale and a positive/negative rating scale, respectively, in two distant domains (i.e., movies and e-commerce products). We can conclude it is crucial for the model to learn what to transfer and how to transfer across these tasks, in order to improve the performance of both models.

Solution Overview
An overview of Meta-DTL is shown in Figure 2. It consists of three modules: i) Task Representation Learning (TRL), ii) Multi-task Meta-learner Training (MMT), and iii) Task-specific Model Finetuning (TMF). Specially, for each task T i , TRL employs a pre-trained task encoder to do a one-pass scan over the training set D i . It represents each task T i as a collection of prototypical vectors, denoted as P i = { p i,j } where p i,j is the j-th prototypical embedding vector of T i , corresponding to the j-th class in D i . Here, P i gives us a panoramic picture of the task T i in the embedding space. As we aim to address distant transfer learning, simple multi-task training inevitably suffers from negative transfer. In MMT, we obtain a meta-learner M that only digests transferable knowledge across all the K tasks. We first construct a prototypebased Meta Knowledge Graph (Meta-KG, denoted as G) from P i , · · · , P K , implicitly describing the relations among tasks and classes. For each training instance of all tasks x i,j , we query x i,j in G to generate the meta-knowledge score m i,j , which represents the degree of the knowledge transferability of the input x i,j . 4 Additionally, the Weighted Maximum Entropy Regularizers (WMERs) are proposed and integrated into the model to make the meta-learner M more task-agnostic and unbiased. Finally, in TMF, we fine-tune the meta-learner M to generate the K classifiers for the K tasks, based on their own training sets D 1 , · · · , D K .

Task Representation Learning
The first step of TRL is to learn the implicit relations among classes across K tasks. In metalearning, prototypes are frequently employed to characterize the class information by concrete representations (Snell et al., 2017). We notice that in NLP tasks, a lot of class labels have rich meanings that are useful to model the class semantics. For example, the label positive in review analysis can be directly associated with positive terms in reviews (e.g., "price-worthy", "good value", "enjoyable"). Let D i,j be the subset of D i with instances assigned to the j-th class label, i.e., The j-th prototypical vector p i,j of task T i is defined as follows: where E(·, ·) is an embedding function that encodes both the textual input x i,j and its class label c i,j by PLMs. By combining all such vectors, we obtain Algorithm 1 Meta-learner Training Algorithm 1: Construct the Meta-KG G(V, L); 2: for each training instance Compute α i,j , β i,j and m i,j based on G; 4: end for 5: Restore the underlying PLM's parameters from the pre-trained model, with others randomly initialized; 6: while number of training steps does not reach a limit do 7: Update all parameters by minimizing the loss function x i,j ∈B L(x i,j ); 9: end while 10: return the meta-learner M (i.e., the collection of updated parameters of the PLM). the representation of the task T i as: The self-attention mechanism (Vaswani et al., 2017) frequently used in PLMs such as BERT (Devlin et al., 2019) and ALBERT (Lan et al., 2020) makes it quite straight-forward to implement the function E(·, ·). For single-text classification, the input to the pre-trained encoder is formatted as i,j represent the pair of x i,j . During the forward pass of the hidden layers, the text inputs and the class label can attend to each other. Hence, the label information is fused into the input text representations. Finally, we take the average pooled output of the last encoder layer as E(x i,j , c i,j ). 6

Multi-task Meta-learner Training
The training algorithm of the meta-learner is summarized in Algorithm 1.

Obtaining the Meta-knowledge
After TRL, we represent all the acquired knowledge P i , · · · , P K as the Meta-KG G(V, L). In G, each prototypical vector p i,j is treated as a node in V . Each edge l i,j,m,n ∈ L denotes the similarity between two prototypical vectors p i,j and p m,n . For simplicity, we have assume edge weight w(l i,j,m,n ) = cos( p i,j , p m,n ), with cos(·, ·) being the cosine similarity function. Hence, G is a highly condensed representation of all the K tasks, considering the class semantics.
During MMT, for each input x i,j , we query x i,j in G to generate the meta-knowledge m i,j . Here, we treat the meta-knowledge as a scalar to represent the degree of transferablity. Firstly, we define the instance-level meta-knowledge α i,j as follows: whereP i is the collection of all prototypical vectors not associated with the task T i . Hence, if (x i,j , c i,j ) is similar to the instances in any class of the K − 1 tasks, it should be more transferable, thus is more useful when we train the meta-learner.
However, using the weight α i,j alone is not sufficiently robust when the input instance is an abnormal sample (i.e., an outlier). We further consider the class-level meta-knowledge β i,j as follows: where we replace E(x i,j , c i,j ) with its class prototypical vector p i,j . We can see that the computation is highly efficient as all such weights have been pre-computed and stored in G. Finally, the metaknowledge m i,j is computed as: m i,j = α i,j +β i,j 2 .

Training the Meta-learner
As discussed earlier, the properties of a good metalearner should be twofold: i) capturing transferable knowledge and ii) being task-agnostic and unbiased.
The architecture of the meta-learner in Meta-DTL is similar to MT-DNN  where each task has its own task-specific output layer, plus a shared PLM-based encoder. The parameters of the underlying PLM encoder are initialized by its pre-training results. A meta controller is designed to control the training process. It constrains that each data instance x i,j is selected with the probability p(x i,j ) = 1 K|D i | . Hence, each task T i is selected with the probability p(T i ) = x i,j ∈D i p(x i,j ) = 1 K . Here, we employ the uniform distribution, i.e., p(T i ) = 1 K , which ensures that each task has equal opportunity to be learned. During training, the first loss is the weighted cross-entropy loss L CE (x i,j ): 7 where 1 (·) is the indicator function that returns 1 if the input function is true and 0 otherwise. τ c (x i,j ) is the predicted probability of x i,j associated with the class c ∈ C i . L CE (x i,j ) ensures that each sample x i,j is weighted by m i,j . Hence, transferable instances gain larger weights during training.
However, minimizing L CE (x i,j ) may result in a biased meta-learner. Consider a simple example where T 1 and T 2 are highly similar to each other; T 3 is more dis-similar. Based on the previous procedure, training instances of T 1 and T 2 have larger weights in general. Hence, the meta-learner is biased towards T 1 and T 2 , which gives poor initialization values when learning the final model for T 3 . To make the model more task-agnostic, inspired by Jamal and Qi (2019), we integrate the meta-knowledge into the maximum entropy regularization and propose the Weighted Maximum Entropy Regularizers (WMERs) as an auxiliary loss, denoted as L M E (x i,j ): where the predicted probability of each sample x i,j is compared against the |C i |-dimensional uniform distribution ( 1 |C i | , · · · , 1 |C i | ) by cross-entropy, weighted by m i,j . L M E (x i,j ) penalizes the metalearner for fitting too much to specific tasks, avoiding the generation of biased models. Finally, the total sample-wise loss L(x i,j ) is derived as: where λ ∈ (0, 1) is a pre-defined balancing factor between the two losses.
Discussion. Based on the derivation of L(x i,j ), we can see that each sample x i,j is only associated with a |C i |-length constant vector where each element is m i,j (1 (c i,j =c) + λ |C i | ), which can be precomputed based on G. Re-consider the process in Algorithm 1. We do not use second-order update steps similar to MAML (Finn et al., 2017). This is because i) such training process of largescale PLMs would be computationally expensive; ii) our algorithm does not have any meta-testing steps. Hence, our algorithm is highly efficient for learning across multiple NLP tasks in a large scale.

Task-specific Model Fine-tuning
After obtaining the meta-learner M, in TMF, we fine-tune M to generate K classifiers for the K underlying tasks separately, based on their own task-specific training sets D 1 , · · · , D K . The metaknowledge and WMERs are removed from the loss function. Hence, the total dataset-level loss function L * (T i ) of the task T i is defined as follows: where τ * c (x i,j ) is the task-specific prediction function of the input x i,j w.r.t. the class c ∈ C i .

Experiments
In this section, we conduct extensive experiments to evaluate the performance of Meta-DTL and compare it against strong baselines.

Datasets and Experimental Settings
We employ both BERT (Devlin et al., 2019) and ALBERT (Lan et al., 2020) as our PLMs to evaluate Meta-DTL 8 . Three sets of NLP tasks are used for evaluation, with the statistics of all the seven public datasets reported in Table 1: • Review Analysis: It transfers knowledge across three datasets for coarse and fine-grained review sentiment classification, namely Amazon (Blitzer et al., 2007), IMDb (Maas et al., 2011) andSST-5 (Socher et al., 2013). Note that the domains of SST-5 and IMDb are different from Amazon.
• Natural Language Inference: Two different sentence pair classification tasks related to Natural Language Inference (NLI) are considered. MNLI (Williams et al., 2018) is a large-scale benckmark dataset, with the task of predicting the relation between a sentence pair as "entailment", "neutral" or "contradiction". SciTail (Khot et al., 2018) is a scientific question answering task, with only two labels: "entailment" and "neutral".
• Lexical Semantics: We further consider two term pair classification tasks extensively studied in lexical semantics. Shwartz (Shwartz et al., 2016) is a popular dataset for hypernymy detection, which aims at classifying term pairs into "hypernymy" ("is-a") or "nonhypernymy" based on their semantic meanings. BLESS (Baroni and Lenci, 2011) is a dataset derived from WordNet, which is used to evaluate lexical relation classification models. The BLESS task involves a wider spectrum of lexical relation types, such as hypernymy, co-hyconymy and meronymy.
In each set of the experiments, we transfer knowledge from all the other tasks in the same set to the target one. For example, the model for SST-5 is trained by transferring the knowledge from both Amazon and IMDb, together with its own training set (SST-5). The training/development/testing splits of Amazon and MNLI are the same as in . As IMDb does not contain a separate development set, we randomly sample a proportion of the training set for parameter tuning. We use the lexical split of the Shwartz dataset in the experiments to prevent lexical memorization and make the results more robust. For data splits of other datasets, refer to their original papers. 9 We implement Meta-DTL and all the baselines on two popular PLMs: BERT (Devlin et al., 2019) and ALBERT (Lan et al., 2020). All the algorithms are implemented with TensorFlow and trained with NVIDIA Tesla V100 GPU (32GB). We use Accuracy as the evaluation metric for all the tasks. For better reproductivity, we uniformly set the sequence length as 128 for the first two sets of experiments and 32 for the third, and set the batch size as 32. The learning rate is tuned from {5e−5, 1e−5}. The numbers of epochs for MMT and TMF are tuned from 1 ∼ 3 and 3 ∼ 5, respectively. λ is set to 0.1 in default. The parameter regularization and the optimizer settings are the same as in Devlin et al. (2019). Since we do not modify the architecture of our final models, models trained by Meta-DTL should have the same size and in-9 Note that the experimental settings in our work may look similar to . However, the task of the previous work is to transfer knowledge among different sub-domains within the same task, such as different sub-domain datasets in Amazon and MNLI, respectively. Our experimental settings are significantly more challenging as we aim to transfer knowledge across datasets with distant domains, language styles and class labels. The learning gaps between tasks in our work are much larger than .

Name
Task Description Classification Label Set #Train #Dev. #Test SST-5  Fine-grained movie review analysis  {1, 2, 3, 4, 5}  8,544  1,101  2,210  Amazon Coarse-grained product review analysis {positive, negative}  7,000  500  500  IMDb  Coarse-grained movie review analysis  {positive, negative}  23,785  1,215  25,000  MNLI  NLI across multiple   ference speed as BERT (Devlin et al., 2019) or ALBERT (Lan et al., 2020). In the following experiments, we reproduce results for all baselines, and report the accuracy scores of both baselines and our method averaged from three random runs (with different seeds). Hence, the impact of random seeds is minimized.

General Performance Comparison
In this section, we compare Meta-DTL against previous approaches. The following four methods are considered as strong baselines: • Single-task: Fine-tuning BERT (Devlin et al., 2019) or ALBERT (Lan et al., 2020) on the single-task training set only.
• Multi-task: Fine-tuning the PLM on all the tasks by multi-task learning. Each task has its own prediction heads, with the architecture similar to MT-DNN .
• Task Combination: Combining all the training sets and treating them as one task. The label set of this method is: i=1,··· ,K C i .
• Meta-FT * : To our knowledge, Meta-FT  achieves the highest performance on cross-domain transfer learning for PLMs. However, it can not handle tasks with different class label sets. We implement a variant named Meta-FT * , which has a separate prediction head for each task.
The results of Meta-DTL and the baselines on all seven testing sets are shown in Table 2. Generally speaking, the performance gains of Meta-DTL over all three sets of tasks and seven datasets are consistent. With the integration of Meta-DTL, the accuracy of fine-tuned BERT boosts 2.2% for review analysis, 1.2% for NLI and 1.1% for two lexical semantic tasks. A similar conclusion holds for AL-BERT. It shows that even the tasks are different in domains and class label sets, Meta-DTL can make PLMs learn from these distant tasks effectively. We also conclude that simple multi-task training does not have significant improvement. Hence, trivial learners may easily suffer from negative transfer. For example, the performance on SST-5 drops on both BERT and ALBERT when the model is jointly trained with Amazon and IMDb, since the learning objective of SST-5 is different from those of Amazon and IMDb. Meta-FT *  is the most competitive approach but is inferior to Meta-DTL due to the lack of modeling the learning process across distant tasks and task-agnostic designs. In summary, Meta-DTL's distant transfer learning ability is hence clearly confirmed.  86.0 85.9 86.9 Table 4: Meta-DTL performance with different metaknowledge on seven testing sets in terms of accuracy.

Detailed Model Analysis
In this section, we analyze the algorithmic performance of Meta-DTL in various aspects. Ablation Study. In Meta-DTL, we employ two important techniques to capture transferable knowledge, namely injecting meta-knowledge and applying WMERs. In the ablation study, we disable one technique from our full model each time. We report the results on the testing sets of the seven tasks, with BERT as the underlying PLM, shown in Table 3. The results show that injecting metaknowledge is slightly more effective than applying WMERs in five out of seven tasks. However, there is no large difference between the two techniques. Therefore, both techniques are proved important for acquiring a good meta-learner. Analysis of Meta-knowledge. We further analyze how different parts of the meta-knowledge contribute to the overall performance, with results shown in Table 4. Columns entitled α i,j and β i,j refer to the adoption of one type of the scores only. representation learning. From the results, we can see that both α i,j and β i,j are effective for learning the meta-knowledge. Comparing m i,j (w/o. class label) to the full implementation, we can see that injecting class label information is also useful. Parameter Analysis. During MMT, the training data in each batch is sampled from different tasks based on p(T i ). The setting p(T i ) = 1 K leads to the under-sampling of large datasets and over-sampling of small ones. Hence, it is infeasible to compute the number of epochs. In this work, we say MMT finishes one epoch when it runs i=1,··· ,K |D i | |B| training steps. We vary the number of epochs in MMT, keep other parameters as default and report the performance of downstream tasks over the development sets. The results on NLI and lexical semantics tasks are shown in Figure 3. The experiments show that too many epochs may hurt the overall performance, forcing the PLM to memorize too much information from non-target tasks. We suggest that one or two epochs of MMT are sufficient for most cases. We also fix the number of epochs in MMT as 2 and tune the hyper-parameter λ from 0 to 0.5, with the results illustrated in Figure 4. We can see that a suitable choice for λ ranges from 0.1 to 0.2. A larger value of λ can inject too much class label noise in the training process, harming the performance. We also tune the hyper-parameters during TMF (i.e, the learning rate and epoch). We find that when we apply Meta-DTL, we generally do not need to change hyper-parameter settings compared to the original fine-tuning approach (Devlin et al., 2019). Due to space limitation, we do not elaborate.

Learning with Small Data
One advantage of transfer learning across tasks is to reduce the amount of labeled training data for target tasks. As MNLI is the largest dataset among all, we randomly sample only 1%, 2%, 5%, 10% and 20% of the original training data to train Score Task Review Text Label High Amazon ...it is just one big failure and is to be avoided... negative IMDb I can't believe I waste my time watching this garbage!... negative  The more you think about the movie, the more you will probably like it. 4 (weakly positive)  No, I hate it. 1 (strongly negative) Low Amazon Racism is not the problem with this book -sure...5 Chinese brothers... negative  ... plays like a badly edited , 91-minute trailer (and) the director can't... 1 (strongly negative) Table 5: Cases of review texts in Amazon, SST-5 and IMDb with high and low meta-knowledge scores.    the model. The full SciTail training set is used for knowledge transfer. We list the results on the MNLI testing set with and without Meta-DTL training in Table 6. The results produced by Meta-FT * are also compared. As seen, Meta-DTL improves the performance regardless of the percentages of the training sets. It has a larger increase in accuracy on smaller training sets (a 4.0% increase on 1% of the training set vs. a 1.0% increase on 20%).

Case Studies
We further present some cases for a better understanding of what the meta-knowledge is across tasks. In Table 5, review texts from Amazon, SST-5 and IMDb with high and low m i,j scores are illustrated, together with their class labels. As seen, although there exist some domain and class label differences, our algorithm is able to find review texts that express general polarities and should be transferable across the three tasks. For instance, the expressions with high scores such as "one big failure" and "garbage" give strong indications of their polarities, no matter what the class label set is concerned. In contrast, low-score texts "5 Chinese brothers" and "91-minute trailer" describe specific details about certain subjects, and are not much useful for knowledge transfer. Hence, the learned meta-knowledge is truly insightful.

Conclusion and Future Work
In this paper, we propose the Meta-DTL framework for PLMs, to capture knowledge from tasks with distant domains and class labels. Extensive experiments confirm the effectiveness of Meta-DTL from various aspects. Future work includes: i) applying Meta-DTL to other PLMs and NLP tasks, and ii) exploring how it can benefit other NLP models.