AdapterDistillation: Non-Destructive Task Composition with Knowledge Distillation

Leveraging knowledge from multiple tasks through introducing a small number of task specific parameters into each transformer layer, also known as adapters, receives much attention recently. However, adding an extra fusion layer to implement knowledge composition not only increases the inference time but also is non-scalable for some applications. To avoid these issues, we propose a two-stage knowledge distillation algorithm called AdapterDistillation. In the first stage, we extract task specific knowledge by using local data to train a student adapter. In the second stage, we distill the knowledge from the existing teacher adapters into the student adapter to help its inference. Extensive experiments on frequently asked question retrieval in task-oriented dialog systems validate the efficiency of AdapterDistillation. We show that AdapterDistillation outperforms existing algorithms in terms of accuracy, resource consumption and inference time.


Introduction
Recently task-oriented dialogue systems have found extensive applications in diverse business domains (Yan et al., 2017;Wei et al., 2018;Valizadeh and Parde, 2022).Owing to the idiosyncratic features of these domains, custom dialogue systems are often required.Nonetheless, the fundamental functions and architectures underlying these systems typically exhibit noteworthy similarities.Hence, the adoption of a platform-based strategy for accommodating task-oriented dialogue systems across multiple domains has emerged as a promising and effective approach.
One popular method is called Multi-Task Learning (MTL), which aims to train multiple tasks simultaneously based on the shared representation of all tasks as shown in Figure 1, resulting in relatively good performance on each task (Collobert * and Weston, 2008;Chen et al., 2022b,a).However, new tenants often register on the platform in a streaming manner.Therefore, predictions for the existing tenants in MTL would be compromised when new tenants are added to the platform since retraining often occurs at that time.
To ensure that tenants do not interfere with each other, an intuitive approach is to train a taskspecific model for each tenant.However, this independent training approach requires a significant amount of resources to store complete model parameters.It is clear that the resource consumption becomes the bottleneck as many tenants register on the platform.Additionally, fine-tuning all parameters on a tenant with very little data can often lead to severe overfitting (Dietterich, 1995;Hawkins, 2004).Thus, providing tenants with the ability to solve designated tasks with limited resources is necessary.
Owing to the distinctive properties of platformbased systems, tasks implemented on the platform had better satisfy the following criteria: 1) The platform witnesses a continuous influx of new tenants.It is incumbent upon the model to ensure the performance of the existing tenants are not destructed when new tenants are added.2) The resources of the platform are limited, thereby necessitating the provision of tenants with the capability to ensure the task performance of each tenant with minimal storage and computational resources.Given this practical limitation, for an incoming tenant, how to utilize the current tenant data and the existing tenant models (also called teacher models) becomes an interesting topic.
In order to maintain the low-resource and scalability of the model while utilizing the existing tenant knowledge, we propose an algorithm called AdapterDistillation.In AdapterDistillation, we employ adapters to capture task specific features by adding a few extra parameters in the transformer layer, and then distillate knowledge from the existing teacher adapters into the current student adapter.To be exact, when a new tenant comes, we first train an adapter module based on its own local data.Then we load the adapter modules of all the current teacher adapters to assist the training of this new student tenant through knowledge distillation.
Our contributions: The proposed approach has several advantages: 1) Fusion is only added during the second stage of training to guide the current student adapter learning and is not required during inference, ensuring structure consistency between the student adapter and the existing teacher adapters.2) Since the adapter structure is consistent during prediction and no additional parameters are required, the scalability and low-resource nature of the model itself are retained.To summarize, our contributions are: • We formulate the construction of a platformbased multi-task problem as a transfer learning problem and leverage the low-resourced adapter model to handle this.
• We propose an AdapterDistillation algorithm which guarantees low-resources and scalability while utilizing the existing tenant knowledge to improve performance.
• We verify noteworthy enhancements of the proposed AdapterDistillation algorithm in terms of accuracy, resource consumption and inference time, using a Frequently Asked Question (FAQ) retrieval service in taskoriented dialog systems.

Relevant Work
Recently, adapters have been proposed to capture task-specific features while maintaining similar results to fine-tuning all parameters, which has been widely applied to downstream tasks such as machine translation and cross-lingual transfer (Houlsby et al., 2019;Pfeiffer et al., 2020b;Li and Liang, 2021;Lester et al., 2021;He et al., 2022).Specifically, adapters insert two bottleneck modules into each transformer layer of the pre-trained model (Houlsby et al., 2019).During training, all parameters of the pre-trained model are frozen, and only the parameters of the newly added modules are trained.Some researchers (Pfeiffer et al., 2020a,b) has further improved the insertion position of adapters through structural search, reducing the number of adapter insertions and thus minimizing the increase in parameter quantity and inference speed.A new type of adapters called Lora (Hu et al., 2022) has been proposed to first perform low-rank decomposition on the model parameters and then insert adapters into the key, query, and value matrices of each attention layer.This approach enhances performance and enables parallel execution of the adapter module, thus reducing inference time.Due to the lack of clarity regarding the inter-dependencies and key success factors of various adapter methods, He et al. (He et al., 2022) dissected the design of several classic adapter algorithms and established a unified framework to explore the connections between different adapter methods.Note that all of the work discussed in this paragraph is complimentary to the proposed method called AdapterDistillation since our algorithm is not limited to a certain type of adapters.
Thus we can combine our developed approach with all of the work discussed in this paragraph to obtain extra gains.
In addition to optimizing the structure and position of adapters for each individual task, adding extra components on the top of multiple adapters to maximize the transfer of knowledge across multiple tasks is another efficient way to enhance the performance on each task.For example, via adding an extra fusion layer, the AdapterFusion method is proposed to effectively share knowledge across multiple tasks while balancing the various target tasks (Pfeiffer et al., 2021).To be specific, this method uses a two-stage training approach: first, train the corresponding adapter for each task, then load all adapters simultaneously and freeze them, and train an additional adapter fusion layer to aggregate the outputs of all adapters, allowing the model to implicitly and automatically learn to utilize knowledge from different tasks.But this approach faces some challenges in practical applications: since the parameter size of the fusion layer is linearly related to the number of loaded adapters, when the number of adapters is too large, a lot of resources will be used for fusion such that the purpose of using adapters gets lost.Additionally, adding a fusion layer after the adapters leads to larger inference time, resulting in a worse user experience.
To efficiently utilize existing task knowledge and meet the requirement of the platform for streaming task integration, we propose a plug-and-play Adap-terDistillation algorithm.By fusing and distilling the knowledge of existing tasks into the current task during training, we can keep the model structure and inference speed unchanged while achieving comparable results to AdapterFusion.

Problem Definition
As we know, training adapters for N tasks in parallel might not be practical since tenants often register on the platform in a streaming manner.Based on this fact, the time for the j-th task registered on the platform is assumed to be earlier than that for the i-th task, that is t j < t i , when j < i for an ordered collection of N tasks denoted as Throughout the paper, we have the following settings which are typically true in practice: 1.The task considered in this paper is nondestuctive, which means when the N -th task is registered on the platform, the performance of the previous (N − 1) tasks {t 1 , t 2 , ..., t N −1 } should not be impacted.
2. Since the platform often has limited resources, it is reasonable to assume every task needs to be solved with limited computing and memory resources.
3. Due to privacy and security issues, corpus of labeled text for the N -th task is only available locally.
Based on the above setting, in this paper we are aimed at maximizing the transfer of knowledge from the existing tasks to the current new task without impacting the existing tasks, which is more suitable for a practical scenario where each task is registered on the platform in a streaming manner.

AdapterDistillation
AdapterFusion allows sharing of information between different tasks through an extra fusion layer at the cost of longer inference time and larger fusion layer (Pfeiffer et al., 2021).However, as a new task is registered on the platform, the existing tasks will be impacted in.In order to mitigate this, we propose AdapterDistillation to allow sharing of information from the existing tasks to the new one while avoiding the impact of the existing tasks without increasing inference time.

Adapter Learning and Distillation Algorithm
The proposed AdapterDistillation algorithm is a two-stage learning algorithm.In the first stage, we train an adapter model ϕ f irst N for the N -th new task when it is registered on the platform based on its own local data.
In the second stage, we employ knowledge distillation to transfer knowledge from the existing tasks to the new adapter, which means the parameter weight of this new adapter will be updated in the second stage.To be exact, assuming there have been (N − 1) adapters registered on the platform with their weights being denoted as {ϕ n } N −1 where D N are corpus of labeled text for task N , η is a predefined constant to balance the distillation loss and the binary cross entropy loss, Ω N is a set of newly learned fusion-related parameters to transfer the existing knowledge from the existing adapters to the N -th adapter for task N , and ϕ N is the final weight for adapter N .It is worth mentioning that similar to AdapterFusion, each adapter in AdapterDistillation will be trained twice where the second stage is mainly aimed at implementing knowledge composition.However, different from AdapterFusion which keeps the fusion layer during the inference time, AdapterDistillation will only employ the N -th adapter module to do inference without adding an extra fusion layer (shown in Figure.2) which leads to faster inference time without impacting the performance of the existing tasks.This makes sense since the N -th adapter weight ϕ N already contains the sharing of information between N tasks after knowledge distillation.

Detailed Components
During the training process, AdapterDistillation learns to distill the knowledge from the existing (N − 1) adapters to the N -th adapter by introducing the fusion weights Ω and updating the N -th adapter weights ϕ.The fusion weights Ω transfer the existing knowledge to the N -th adapter module by dynamically introducing a distillation loss term as shown in ( 1).This will push the N -th adapter to learn knowledge not only from its own task data D N but also from the previous (N − 1) adapter intermediate representations.
As shown in Figure 2, our AdapterDistillation architecture contains three components, that is, an adapter fusion, N teacher adapters and a N -th student adapter.In the student adapter part, the output of the feed-forward sublayer of layer l at iteration t, denoted as h l,t , is fed into the N -th adapter to obtain the N -th adapter output z l,t,N = g(h l,t , ϕ) with g(h l,t , ϕ) being the nonlinear transformation and ϕ being the adapter parameters to be optimized.Interestingly enough, in the N teacher adapters, we not only use the previous (N − 1) fully trained adapters ϕ n as teacher adapters to enable sharing of information between different tasks but also add the N -th partially trained adapter ϕ f irst N obtained in the first stage as a teacher adapter to insert some task specific knowledge.In other words, the output of N teacher adapters can be denoted as z l,t,n = g(h l,t , ϕ n ) for n = 1, 2, . . ., N − 1 and z l,t,N = g(h l,t , ϕ f irst N ).Similar to AdapterFusion (Pfeiffer et al., 2021), our AdapterDistillation dynamically combines different adapters by introducing Query Q l , Key K l , and Value V l at each transformer layer l with its complete set being Ω = {Q l , K l , V l } L l=1 .We employ z l,t,n for n = 1, 2, ..., N as the input to the value and key transformation to obtain z v l,t,n = z ⊤ l,t,n V l and z k l,t,n = z ⊤ l,t,n K l , respectively.The output of the feed-forward sublayer h l,t is used as input to the query transformation to obtain h Q l,t = h ⊤ l,t Q l .Then the output of the adapter fusion o l,t can be obtained as with [z v l,t,1 , z v l,t,2 , ..., z v l,t,N ].Note that we employ p l,t to learn to weight the adapters with regard to the context.Finally, the distillation loss described in (1) is defined as (3) It is worth mentioning that in the second stage we jointly optimize the adapter fusion Ω and ϕ so as to obtain the optimal ϕ N which contains the most useful mixed knowledge from available adapters.
Then during the inference stage, we only employ ϕ N to implement the prediction for the N -th task without considering the adapter fusion part so as to reduce inference time.On the other hand, only employing ϕ N can also lead to comparable performance as AdapterFusion, which will be shown next.

Experiments
To validate the effectiveness of AdapterDistillation in terms of accuracy, resource consumption and inference time, its performance is evaluated through a practical scenario where Frequently Asked Question (FAQ) retrieval is considered in task-oriented dialog systems.

Experimental Setting
To

Datasets and Metrics
We select 9 existing tenant models from the platform as teacher adapters, covering fields such as medical care, transportation, insurance, shopping, photography, lease and et al, and employ the performance of the 10-th tenant (student adapter) to evaluate the considered models.In order to reduce the variance, we independently choose 8 unregistered tenants from different business domains as the 10-th student tenant.The 8 independent student tenants are from international payments, merchant payments, broadband installation, cross-border payments, merchant signing, human resources, administrative management, and IT support.The data size for each student tenant ranges from 1000 to 5000 which has been divided into the training, validation, and test dataset with the ratio being 8:1:1.It is worth mentioning that we not only use accuracy and AUC to evaluate the performance, but also use the resource consumption and inference time as additional metrics to indicate the functionality of the models of interest for online practical applications.In addition to accuracy and AUC, resource consumption is an important indicator of deployment costs.In terms of storage space required during the inference stage, the pre-trained Bert-base-Chinese model takes up approximately 391MB.The fusion module and the adapter module occupies 82MB and 3.5MB, respectively.The last classification layer requires 2.3MB.As a result, in addition to the    2 indicate that AdapterDistillation has significant advantages on resource consumption compared to others, which becomes more pronounced as the storage space becomes larger.For online applications, inference time is closely related to the actual user experience.Next we compare inference time of three algorithms versus different batch sizes.The results in Table 3 show that AdapterDistillation has the same inference time as the Adapter method but it is significantly better than AdapterFusion.This is reasonable because an extra fusion layer in AdapterFusion takes some extra inference time.It is worth noting that the inference time of the Adapter/AdapterDistillation method is slightly larger (about 2.5%) compared to full fine-tuning, which is caused by the serial insertion of the adapter module.Note that AdapterDistillation is independent of the structure of Adapter itself and can be hot-swapped into any Adapter-like method, such as Lora (Hu et al., 2022), to maintain the same inference time as full fine-tuning through parallel insertion.

As shown in
In order to verify the improvement in model performance is due to the sharing of information from different tasks, we remove the current tenant from teacher adapters and train only using the existing tenants as teacher adapters.Figure 3 indicates that compared to adding the current tenant to the Teachers, the average accuracy is just slightly decreased by about 0.11%, but still outperforms adapters by about 1.18%.This indicates AdapterDistillation can effectively use multiple resources of extracted information.

Conclusions and Future Work
We proposed a novel and plug-and-play multiadapter knowledge distillation algorithm called AdapterDistillation to implement the sharing of information between different tasks.Specifically, our proposed algorithm consisted of two stages of training.In the first stage of training, taskspecific knowledge was extracted by training a student adapter using local data.Then in the second stage, knowledge from the existing teacher adapters was distillated into this student adapter through optimizing the distillation loss.Note that Adap-terDistillation only employed the trained student adapter to implement inference, which resulted in fast inference time and low resource consumption.Our proposed AdapterDistillation algorithm outperformed existing algorithms in terms of accuracy, resource consumption and inference time, meeting a practical scenario where numerous tenants accessing the platform in a streaming manner.In the future, plugging more advanced adapter structures into AdapterDistillation is an interesting direction to explore.
n=1 and the N -th adapter with its weight trained in the first stage being denoted as ϕ f irst N .With a fixed pretrained Bert-based model Θ and the existing adapters {ϕ n } N −1 n=1 and ϕ f irst N , the data fusion related parameters Ω and the N -th adapter weight ϕ have been introduced to learn how to distill knowledge from the existing adapters {ϕ n } N −1 n=1 and ϕ f irst N to better solve the N -th task.The training process can be represented as

Figure 2 :
Figure 2: Our AdapterDistillation architecture.This includes trainable weights Query Q l , Key K l , Value V l and the N -th adapter weight ϕ N at each transformer layer l.

Figure 3 :
Figure 3: Accuracy of the 10-th tenant from 8 different domains for the different architectural setups.AdapterDistill* is the algorithm where we remove the current tenant as a teacher adapter in the second stage.

Table 1 :
The accuracy and AUC for the 10-th tenant with different architectural setups.The result within the parenthesis is AUC.Added Params Per Task represents the percentage of additional parameters added for each task.

Table 2 :
The number of tenants that can be supported by different methods versus the storage space.

Table 3 :
Inference of a single forward pass measured in milliseconds, averaged over 100 times.