Connectivity Patterns are Task Embeddings

Task embeddings are task-specific vectors designed to construct a semantic space of tasks, which can be used to predict the most transferable source task for a given target task via the similarity between task embeddings. However, existing methods use optimized parameters and representations as task embeddings, resulting in substantial computational complexity and storage requirements. In this work, we draw inspiration from the operating mechanism of deep neural networks (DNNs) and biological brains, where neuronal activations are sparse and task-specific, and we use the connectivity patterns of neurons as a unique identifier associated with the task. The proposed method learns to assign importance masks for sub-structures of DNNs, and accordingly indicate the task-specific connectivity patterns. In addition to the storage advantages brought by the binary masking mechanism and structured sparsity, the early-bird nature of the sparse optimization process can deliver an efficient computation advantage. Experiments show that our method consistently outperforms other baselines in predicting inter-task transferability across data regimes and transfer settings, while keeping high efficiency in computation and storage.


Introduction
With the rapid development and excellent performance of large pre-trained language models (PLMs), the most prevalent paradigm in natural language processing (NLP) has become pretraining then fine-tuning (Peters et al., 2018;Devlin et al., 2019a;Brown et al., 2020;Lewis et al., 2020;Raffel et al., 2020).Extending upon the two-step training procedure, previous works show that intermediate-task transfer, i.e., fine-tuning the model on an intermediate source task before the target task, can yield further gains (Phang et al., 2018; Wang et al., 2019a).Nevertheless, the improvement by intermediate-task transfer heavily relies on the selection of a proper intermediate task because some source tasks lead to performance degradation (Yogatama et al., 2019;Pruksachatkun et al., 2020).One straightforward approach is to enumerate every possible (source, target) task combination, but it is extremely expensive.Therefore, recent works explore methods to predict inter-task transferability accurately with high efficiency.
The current state-of-the-art (SOTA) works are established on task embeddings, (i.e., leveraging a single vector to represent a task).They predict inter-task transferability by computing the similarity between task embeddings.Task2Vec (Achille et al., 2019;Vu et al., 2020) develops task embeddings based on the Fisher information matrix while requiring fine-tuning the full model and consuming a large amount of storage (Zhou et al., 2022).Recently, researchers propose that the efficiently tuned parameters like prompts (Li and Liang, 2021;Liu et al., 2021) and LoRA (Hu et al., 2022) encode rich information for a task and thus can serve as task embeddings (Poth et al., 2021;Vu et al., 2022;Zhou et al., 2022).However, these tuned parameters are sensitive to model initialization and stochasticity (Li and Liang, 2021;Lester et al., 2021), and optimizing these parameters consumes significantly more computational resources than traditional finetuning (Ding et al., 2022).
Different from them, we draw inspiration from the shared working mechanisms of DNNs and biological brains to develop high-quality task embeddings.We start by considering which parts of knowledge within the model are being utilized for a given task.Typically, recent works in sparse optimization and model pruning have shown that sub-structures (e.g., neurons, attention heads, channels, and layers) from different parts of the model exhibit specialization in distinct knowledge and possess varying degrees of importance for a particular task (Dalvi et al., 2020;Liu et al., 2017;Voita et al., 2019a;Glorot et al., 2011;Georgiadis, 2019;Li et al., 2022).These are consistent with the findings in neuroscience that activities of neurons and connectivities in biological brains are sparse (Kerr et al., 2005;Poo and Isaacson, 2009;Barth and Poulet, 2012) and task-specific (Duncan, 2010;Fox et al., 2005;Crinion et al., 2003;Newton et al., 2007).The aforementioned remarkable findings motivate us to use task-specific connectivity patterns in DNNs to represent tasks.
In this work, we propose a novel task embedding, namely Connectivity Patterns as Task Embedding (COPATE), and apply it to predict the inter-task transferability, as illustrated in Figure 1.Our key insight is that in over-parameterized DNNs, there exist connectivity patterns (i.e., the structures of subnetworks) that are functional for one certain task, and can capture high-density task-specific information.Concretely, we assign importance masks to attention heads and intermediate neurons of PLMs, jointly train the masks and the model, and extract task embeddings according to the learned masks.Our method has two strengths in efficiency: 1) it is computation-friendly as we extract connectivity patterns early in the training; 2) it is storage-friendly because our embedding granularity is coarse-grained, and COPATE can be represented by a binary mask.Experiments show that compared to other approaches, COPATE has superior inter-task prediction capability across data regimes and transfer settings.Our codes are available at Github1 .
Our contributions can be summarized as follows: • Inspired by the working mechanisms of DNNs and biological brains, we propose COPATE, a novel task embedding that represents tasks with sparse connectivity patterns.
• We propose a method to obtain COPATE with sparse optimizing techniques, and show the significant positive correlation between embedding similarity and task transferability.
• We conduct thorough experiments on 342 transfer combinations with different settings to show the effectiveness of our method.We further explore an intermediate-curriculum transfer setting to investigate whether there is a beneficial curriculum for a target task.
2 Identifying Sparse, Task-specific Connectivity Patterns In this section, we demonstrate the framework to identify task-specific connectivity patterns.We represent the task-specific connectivity patterns via the structure of essential subnetworks found by sparse optimizing and pruning techniques (Liu et al., 2017;Chen et al., 2021a;Zheng et al., 2022), including the searching stage (Sec 2.1) and the extracting stage (Sec 2.2).

Finding Connectivity Patterns
Typically, BERT is constructed by multiple transformer encoder layers that have uniform structure (Vaswani et al., 2017).Each layer has a multi-head self-attention (MHA) block, a feed-forward network (FFN), and residual connections around each block.The MHA is formulated as: where x is input, N h is the number of heads, and the projections the key, query, value and output matrices in the i-th attention head.Here d is the hidden size (e.g., 768), and d h = d/N h denotes the output dimension of each head (e.g., 64).
An FFN parameterized by where d f = 4d.Figure 2: Correlation between COPATE similarity and inter-task transferability.Each point represents a source task to a target task.The x-axis is the similarity between the associated source and target , averaged over three runs, and the y-axis measures the relative transfer gain on the target.We include the Pearson correlation coefficient (r) and p-value.The plots illustrate a significant positive correlation between COPATE similarity and inter-task transferability.See Appendix B for results on more datasets.

Learnable Importance Masks
We adopt a coarse-grained structured pruning strategy to shape connectivity patterns.Specifically, we use the modified network slimming (Liu et al., 2017;Chen et al., 2021a) to find which heads and intermediate neurons are essential for a given task.We first assign learnable importance masks to each head and intermediate neuron: where m H denotes the masks for heads, i is the index of head, and m F denotes the masks for FFN.Then, we can jointly train BERT with importance masks but with a sparsity-inducing regularizer: where m = {m H , m F }, λ H and λ F denote regularization strength for the two kinds of masks respectively.Hence, the final optimizing objective is: where L is the original loss function of fine-tuning.

Extracting Connectivity Patterns
Early-stopping Strategy Note that the joint training is still as expensive as traditional finetuning.Fortunately, (You et al., 2020) and (Chen et al., 2021b) point out that the importance masks converge early in the searching stage.This inspires us to stop the joint training early and dig out earlybird connectivity patterns to generate task embeddings.Nevertheless, it is difficult to determine the exact search termination time as the termination moments of different tasks are different.Moreover, masks of MHA and FFN typically have different convergence rates.Hence, we adopt a termination metric following (Xi et al., 2022) which terminates the searching process when the normalized mask distances between several consecutive miniepochs are all smaller than a threshold γ2 .
Pruning Strategy After the joint training, we can perform pruning to the original models to extract important connectivity patterns that encode taskspecific information.Specifically, the self-attention heads and intermediate neurons with the smallest importance masks are believed to contribute the least to the task and the corresponding masks are set to 0, while the masks of the surviving elements are set to 1. Therefore, we can generate storageefficient task embeddings with the resulting model structure.

COPATE: Connectivity Patterns as Task Embedding
In this section, we first show how we generate task embeddings with task-specific connectivity patterns at hand (Sec 3.1).Next we provide empirical evidence for the appropriateness of using the obtained task embeddings to predict inter-task transferability in Sec.3.2.

Task Embedding Generating
Typically, the structure of a neural network can be represented as a mask vector: where N denotes the number of elements (i.e., substructures) that construct the network and the value of mask m i indicates whether the i-th element is pruned or not.In our framework, the elements are self-attention heads and intermediate neurons, so the structured subnetworks are represented by: where N L denotes the number of transformer layers, N h denotes the number of heads in each layer and N f is the number of intermediate neurons in each layer.Hence, the resulting task embedding is: We summarize the procedure of generating CO-PATE in Algorithm 1. COPATE is quite storageefficient owing to its binary form.For example, BERT BASE consumes only 4626 bytes to store3 .

Positive Correlation between COPATE Similarity and Task Transferability
We first calculate the similarity between COPATEs of different tasks with Hamming Similarity, which is defined as the number of positions at which the corresponding symbols are the same: where Since the numbers of self-attention heads and intermediate neurons differ significantly, we calculate the similarity of the two types of elements separately, and each contributes equally to the final similarity.
We then explore whether the similarity between COPATEs is correlated with task transferability.We calculate related transfer gain to measure the impact of transfer learning.Specifically, given a source task s and a target task t, if a baseline PLM that is directly fine-tuned on the target dataset (without any intermediate transferring) achieves a performance of T (t), while a transferred model achieves a performance of T (s, t), the relative transfer gain can be expressed as: G(s, t) = T (s, t) − T (t) T (t) .
Figure 2 shows how the relative transfer gain changes as a function of the similarity between the source and target task embeddings.Overall, there is a significant positive correlation between the similarity of task embeddings and task transferability on the majority of the target tasks (16 out of 19).It is possible for the correlation coefficient to attain a high magnitude in many cases, such as on the DROP task, where the correlation coefficient is 0.78 (p = 0.00013).
The exciting results suggest that COPATE is promising in accurately predicting inter-task transferability.Concretely, for a novel target task, we EP means epochs to search for connectivity patterns.R1 denotes Regret@1 and R3 denotes Regret@3.For NDCG, higher is better; for Regret, lower is better.The best performance in each group is highlighted in bold.
rank the candidate source tasks in descending order by the COPATE similarity and select the top-ranked task for intermediate fine-tuning.

Predicting Task Transferability
In this section, we perform thorough experiments to empirically demonstrate the capability of COPATE in predicting inter-task transferability.

Experimental Setup
Datasets We conduct experiments with 8 tasks of text classification or regression (CR) and 11 tasks of question answering (QA) following previous works (Vu et al., 2020;Zhou et al., 2022).We list the datasets in Appendix A.  source tasks and the optimal source task 5 .In our experiments, we include k = 1 and k = 3.

Implementation Details
We perform transfer experiments with all (source, target) combinations and use BERT BASE (Devlin et al., 2019b) as the backbone.All the intermediate tuning and target tuning take 3 epochs.For FULL → FULL regime, we use the results from (Vu et al., 2020).We implement all baseline methods according to their opensource codes and the Transformers library (Wolf et al., 2020).When searching for connectivity patterns in our method, we jointly train the masks and the BERT model for 5 epochs.When extracting early-bird embeddings (i.e., EARLY-EMB), we set the max searching epoch number to 1.We perform 5 restarts for stable results in LIMITED regimes.See Appendix F for more details.

Experimental Results
Table 1 demonstrates the detailed evaluating results.Overall, the proposed COPATE achieves superior performance across task types, transfer scenarios and data regimes, revealing that it is a robust and accurate predictor of beneficial transfer.
FULL → FULL In this regime, our method attains impressive performance compared to other baselines.For example, in the setting of in-class transfer of Classification tasks, COPATE exceeds the most competitive baseline by 1.0 in NDCG, and the Regret@3 score achieves 1.2.It is also observed that excessive training steps for identifying task-specific connectivity patterns do not necessarily result in large performance improvement in this regime.The efficient EARLY-EMB performs slightly worse than LTH EP =5 , but still performs comparably. 5See Appendix E for more results about Regret@k FULL → LIMITED In this few-shot regime, our method achieves comparable performance to SOTA baselines.However, we find that in QA tasks, the performance of COPATE degrades sharply as the number of training steps utilized during the search stage decreases.Compared to LTH EP =5 , EARLY-EMB's NDCG on in-class and all-class decreased by 10.1 and 8.9, respectively.This trend is also observable in LIMITED → LIMITED regime.It is not surprising as QA tasks are typically more complex and the connectivity patterns require more training steps to converge better.This suggests a trade-off between performance and efficiency when facing limited examples, and additional training resources should be allocated to the search stage to extract high-quality task embeddings.
LIMITED → LIMITED In this regime, COPATE demonstrates exceptional performance and surpasses other existing baselines by a significant margin.For instance, our method outperforms the strongest baseline by 9.5 in terms of NDCG on in-class transfer of QA tasks, and 4.6 on all-class transfer of QA tasks.

Ablation Study
In this section, we perform ablation studies to show the contribution of each component of our method.
Head v.s.FFN Previous experiments utilize both masks of attention heads and intermediate neurons to compute similarity.Here, the contribution of each component is evaluated individually by separately using them to calculate similarity and subsequently assessing the NDCG.Table 2 shows that both components play essential roles in ranking source tasks.We observe that on CR tasks, heads outperform FFN by a large margin, revealing that heads are more important in such tasks.
Impact of Sparsity Figure 3 illustrates the relationship between the level of sparsity and the performance of the obtained embeddings.The performance of the model is significantly impacted by variations in the pruning ratio of heads or FFN when the target tasks are CR, while such variations have a limited effect when the target tasks are QA, revealing that CR tasks are more sensitive to embedding sparsity.After comprehensive consideration, we believe that 1/3 and 0.4 are reasonable sparsity for heads and FFN, respectively.We include more ablation studies of pruning strategies, early-stopping thresholds, and the sparsity-inducing regularizer in Appendix I.

Computation and Storage Consumption
Table 3 lists the computational and storage cost of each method.COPATE demonstrates efficiency in both aspects thanks to proper designs ( i.e., earlystopping, structured pruning and binary form of embeddings), particularly EARLY-EMB, which exhibits the fastest generation speed and only requires 4.6K bytes to store.TASKEMB is also computation-efficient, but it requires much more storage than COPATE.While TEXTEMB is the only method that is comparable to our approach in terms of efficiency, it falls behind EARLY-EMB with an average difference of 1.6 in NDCG.
Further Storage-efficiency with Task-specific Layers Previous studies have established that in BERT, layers are redundant (Dalvi et al., 2020), and that shallower transformer layers contain more general information while deeper layers contain more task-specific information (Voita et al., 2019a;Kim et al., 2020;Sajjad et al., 2020).These in- sights shed light on further reducing the storage of COPATE by representing tasks using a select number of layers, or even a single layer.Figure 4 illustrates the evaluated performance.We observe that: (1) Using a select number of layers does not result in a significant decrease in performance, and sometimes delivers better performance.
(2) Top-down strategy outperforms bottom-up strategy, and consistently exceeds the full model in few-shot settings, showing that deep layers can effectively encode task-specific information, which is in line with previous studies.As a result, if we adopt the last six layers for embedding generation, 50% of the storage can be saved, while little decrease in performance is incurred.We also explore the potential of generating embeddings using a single layer, while sacrificing little performance in Appendix J.

COPATE Captures Task Relationships
The heatmap in Figure 5 illustrates the hierarchical clustering of the similarities between COPATEs.
The results indicate that the obtained embeddings effectively capture various intuitive task relationships.We observe that tasks with similar characteristics congregate in clusters, such as QA tasks (Wik-iHop, SQuAD-1, SQuAD-2, DuoRC-s, DuoRCp, NewsQA, and HotpotQA), similarity and paraphrasing tasks (STS-B and MRPC), NLI tasks (QNLI and MNLI), and single sentence classification tasks (SST-2 and CoLA).In particular, a closer examination of the clustering reveals that SQuAD-1 and SQuAD-2 are closely grouped together, with the latter being an extension of the former (Rajpurkar et al., 2016(Rajpurkar et al., , 2018)).Furthermore, the tight clustering of DuoRC-p and DuoRC-s is also noteworthy, as they are variations of the same movie plots with different lengths (Saha et al., 2018).

Intermediate-curriculum Transfer
Here, we extend the boundary of intermediate-task transfer and examine the potential benefits of a specific intermediate task curriculum (i.e., a particular order to arrange several tasks) to a target task using COPATE.Three distinct curriculum strategies are considered: (1) Similar-first strategy which selects the three tasks that are most similar to the  target task and arranges the intermediate tasks in a sequential order of similarity.
(2) Different-first strategy which also selects the three tasks that are most similar to the target task, but arranges the intermediate tasks in an order of dissimilarity.(3) Recursive-similar strategy which starts from the target task, recursively finds the task that is most similar to the current task three times, stacks them, and then sequentially pops these found tasks for intermediate fine-tuning.The results in Table 4 show that: (1) Each curriculum can boost the target task, validating the value of intermediate-task transfer.
(2) The recursive-similar strategy yields the most performance gain, suggesting that making each intermediate task learned better can deliver more benefits to target tasks.
(3) The different-first strategy performs better than the similar-first, implying that intermediate tasks that are similar to the target task should be assigned later.

Related Work
Predicting Beneficial Intermediate Tasks It has been shown that intermediate-task transfer can deliver performance gains for many target tasks (Phang et al., 2018;Wang et al., 2019a;Talmor and Berant, 2019;Liu et al., 2019), but improper intermediate tasks can result in negative transfer results (Yogatama et al., 2019;Pruksachatkun et al., 2020).Hence, researchers try to accurately identify the most beneficial source task based on metadata or extracted representations of tasks (Alonso and Plank, 2017; Vu et al., 2020;Poth et al., 2021).Recent works represent tasks with embeddings that are generated from data representations (Vu et al., 2020), model weight information (Achille et al., 2019;Vu et al., 2020), and efficiently tuned parameters (Poth et al., 2021;Vu et al., 2022;Zhou et al., 2022).Different from them, we start from a model architecture perspective and use connectivity patterns to represent tasks.

Conclusion
In PTUNING, we adopt P-Tuning v2 in (Liu et al., 2021), which implements a prompt tuning method by introducing additional attention prefix matrices to each transformer layer.We set the prefix length to 20.For LORA, we set the r to 8 and α to 8.For the searching stage of winning tickets, we set the regularization strength λ H and λ F to 1e − 4.
G More Results of Head v.s.FFN Table 6 and Table 7 show the results of Head v.s.FFN in FULL → LIMITED and LIMITED → LIMITED , respectively.We can still find that both of them are important for high-quality task embeddings.

H More Results of Impact of Sparsity
Figure 7 and Figure 8 show the results of impact of sparsity in FULL → LIMITED and LIMITED → LIMITED , respectively.We can still find that 1/3 and 0.4 are reasonable sparsity for heads and FFN, respectively.

I More Ablation Studies I.1 Impact of Pruning Strategies
In this section, we investigate the impact of different pruning strategies to the embedding performance.Results in Table 8, Table 9 and Table 10 show that layerwise pruning and global pruning are proper strategies for self-attention heads and FFN, respectively.We can observe that the performance of embeddings converges when γ reduces to near 0.05.

I.2 Impact of Different Early-Stopping Thresholds
In this section, we investigate the impact of different values of the early-stopping threshold γ. Results in Figure 9 show that the performance of CO-PATE converges when γ reduces to near 0.05.

I.3 Importance of Sparsity-inducing Regularizer
In this section, we investigate the importance of the sparsity-inducing regularizer during the connectivity pattern searching stage.Results in Table 11 show that the regularizer is indispensable for /D\HUV

Figure 1 :
Figure 1: An overview of COPATE, including the procedures for searching connectivity patterns, generating task embeddings, and selecting source tasks.

Figure 3 :
Figure 3: Impact of sparsity on the performance of COPATE.The results are from FULL → FULL regime.See more results in Appendix H.

Figure 4 :
Figure4: The impact of using different transformer layers for embedding generation.The number of used layers is shown on the x-axis; that number is either selected "bottom-up" or "top-down".More precisely, a bottom-up setting selecting 4 layers means we use the transformer layers {0, 1, 2, 3}; a top-down setting selecting 4 layers means we mask the transformer layers {8, 9, 10, 11}.The NDCG is an average of different transfer settings.

Figure 7 :Figure 8 :Figure 9 :
Figure 7: Impact of sparsity on the performance of CO-PATE.The results are from FULL → LIMITED regime.

Figure 10 :
Figure 10: The impact of using one single transformer layer for embedding generation.The NDCG is an average of different transfer settings.
Algorithm 1: COPATE Generation Input: model parameters θ, learnable importance masks m, learning rate η, sparsity for self-attention heads p H , and sparsity for intermediate neurons p F .
2 Initialize θ to pre-trained weights; 3 is satisfied, or the fine-tuning is done; 8 Procedure GENERATING COPATE WITH LEARNED MASKS 9 Reset m H and m F to binary form with p H and p F according to mask magnitudes, respectively;

Table 1 :
R3 ↓ NDCG↑ R1 ↓ R3 ↓ NDCG↑ R1 ↓ R3 ↓ NDCG↑ R1 ↓ R3 ↓ NDCG↑ Evaluation results of intermediate task selection methods.In-class means that the candidate source tasks have the same type as the target task, while all-class means the candidate source tasks come from all types of tasks.
For every (source, target) dataset pair, we perform transfer experiments in three data regimes to simulate real-world situations: FULL → FULL , FULL → LIMITED , and LIMITED → LIMITED.The FULL regime includes all training data, while in LIMITED settings, we limit the amount of training data by randomly selecting 1K training examples.

Table 2 :
Ablation results when heads or intermediate neurons are removed from similarity computing.The results are from FULL → FULL and others are in Appendix G. In-cls and all-cls are short forms of in-class and all-class, respectively; w/o means "without".Both heads and FFN are important for ranking source tasks.

Table 3 :
Evaluation of time and storage consumptions.
We average the results on all datasets.The #Time is quantified as a multiple of the duration of the traditional fine-tuning for a single epoch.We get results from one NVIDIA 3090 GPU for a fair comparison.The #Storage is in bytes and each f loat number requires 4 bytes.

Table 4 :
Performance gain yielded by each curriculum.The results are an average on all 19 tasks.

Table 6 :
Ablation results when heads or intermediate neurons are removed from similarity computing in FULL → LIMITED regime.

Table 7 :
Ablation results when heads or intermediate neurons are removed from similarity computing in LIM- ITED → LIMITED regime.