Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression

Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks. However, the massive size of these models poses huge challenges for their deployment in real-world applications. While numerous model compression techniques have been proposed, most of them are not well-suited for achieving extreme model compression when there is a significant gap in model scale. In this paper, we introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT), which effectively transfers the knowledge of LLMs to extremely small-scale models (e.g., 1%). In particular, our approach extracts knowledge from LLMs to construct a knowledge store, from which the small-scale model can retrieve relevant information and leverage it for effective inference. To improve the quality of the model, soft prompt tuning and Proximal Policy Optimization (PPO) reinforcement learning techniques are employed. Extensive experiments are conducted on low-resource tasks from SuperGLUE and GLUE benchmarks. The results demonstrate that the proposed approach significantly enhances the performance of small-scale models by leveraging the knowledge from LLMs.


Introduction
Pre-trained language models (PLMs), such as BERT/RoBERTa (Devlin et al., 2019;Liu et al., 2019), have demonstrated exceptional performance across various natural language processing (NLP) applications.However, these models typically comprise hundreds of millions of parameters, presenting a substantial challenge for researchers due to their massive scale.As a result, the full potential of large-scale pre-trained language models (PLMs) remains untapped.To tackle this challenge, a multitude of model compression techniques have been proposed, encompassing knowledge distillation (Sanh et al., 2019;Jiao et al., 2020;Passban et al., 2021), network pruning (Liang et al., 2021a;Gordon et al., 2020), quantization (Zhang et al., 2020;Tao et al., 2022) and weight sharing (Lan et al., 2020).
However, these model compression methods are not directly applicable to scenarios requiring high compression ratios, such as knowledge distillation.In such cases, the introduction of assistant models (Mirzadeh et al., 2020;Son et al., 2021) often leads to decreased and unstable performance.Recently, there has been a growing interest in leveraging large language models (LLMs) (Touvron et al., 2023;Zeng et al., 2022;Ouyang et al., 2022;Scao et al., 2022) that possess extensive language knowledge and can be effectively employed in various downstream tasks.Consequently, it is essential to explore methods for transferring this knowledge to small-scale models.Nonetheless, existing approaches are inadequate for compressing LLMs due to their exceptionally high compression ratios.Some prior research (Wang et al., 2022;Dai et al., 2023;Ubani et al., 2023) has suggested utilizing LLMs for data augmentation and knowledge transfer to small-scale models, which allows the latter to demonstrate improved performance on lowresource datasets.However, when tackling more challenging tasks like the SuperGLUE benchmark (Wang et al., 2019a), the limited parameter size of small-scale models becomes a hindrance, preventing them from effectively retaining the knowledge transferred by LLMs.Consequently, the performance enhancement achieved for small-scale models remains constrained.
To effectively transfer the knowledge of Large Language Models (LLMs) to small-scale models, enabling them to efficiently and accurately complete tasks, we propose a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT).Our approach involves two main steps: knowledge extraction from the LLM to construct a knowledge store, and subsequent retrieval of relevant information from the knowledge store by the small-scale model to accomplish the task.More specifically, we employ the technique of soft prompt tuning to fine-tune an LLM, ensuring that it generates in-domain samples.Additionally, we introduce the Proximal Policy Optimization (PPO) (Schulman et al., 2017) reinforcement learning algorithm to enhance the generation quality.As a final step, the small-scale model learns to retrieve pertinent information from the knowledge store.We perform an extensive set of experiments on truly low-resource and challenging tasks sourced from SuperGLUE (Wang et al., 2019a) and GLUE (Wang et al., 2019b) benchmarks.The experimental results demonstrate that RetriKT significantly enhances the performance of small-scale models and outperforms previous SOTA knowledge distillation methods by leveraging the knowledge of LLMs.This indicates the effectiveness and practicality of the retrieval-based knowledge transfer paradigm for extreme model compression.
Our contributions can be summarized as follows: • We propose a new compression paradigm called Retrieval-based Knowledge Transfer, which aims to transfer knowledge from LLMs to extremely small-scale models.This paradigm addresses the challenge of achieving extreme model compression when there is a significant gap in model scale.
• We introduce the reinforcement learning algorithm PPO to enhance the generation quality, and carefully design the reward function.This technique contributes to improving the diversity and accuracy of the knowledge extracted from LLMs used for knowledge transfer.
• We conduct extensive experiments on lowresource tasks from the SuperGLUE and GLUE benchmarks.The results demonstrate that RetriKT significantly enhances the performance of small-scale models and outperforms previous SOTA knowledge distillation methods by leveraging the knowledge from LLMs.

Related Work
This paper involves three areas of work: knowledge distillation, reinforcement learning, and data augmentation.

Knowledge Distillation
We first introduce related work of knowledge distillation (KD), which can be categorized as response-  (Sun et al., 2019), matching hidden representations of all tokens (Jiao et al., 2020), and introducing customized feature-based distillation (Sun et al., 2020).
On the other hand, relation-based methods (Park et al., 2021;Liu et al., 2022a;Jiao et al., 2020;Wang et al., 2020Wang et al., , 2021a;;Wu et al., 2023) aim to emphasize the importance of capturing and utilizing relations among multi-granularity representations in both horizontal and vertical directions.

Reinforcement Learning
Recently, reinforcement learning (RL) has gained significant attention in the field of language modeling.This approach has been successfully applied to various language tasks, including summarization (Paulus et al., 2018;Ziegler et al., 2019;Stiennon et al., 2020;Wu et al., 2021), dialogue systems (Zhou et al., 2017;Jaques et al., 2020;Hancock et al., 2019), machine translation (Bahdanau et al., 2017;Kreutzer et al., 2018;Kiegeland and Kreutzer, 2021), semantic parsing (Lawrence and Riezler, 2018), story generation (Zhou and Xu, 2020), and question generation (Pang and He, 2021).Concurrently, there has been a growing interest in using RL to align language models with human preferences across a wide range of language tasks.For example, Ouyang et al. (2022) employed Open-book knowledge-store ... RL techniques, specifically Proximal Policy Optimization (PPO) (Schulman et al., 2017), to finetune a large language model and align it with models of human preference.Wu et al. (2019) and Kumar et al. (2020) have introduced a method to create synthetic data by randomly masking words in the original training instances.Another works (Ding et al., 2020;Yang et al., 2020;Anaby-Tavor et al., 2020) involved utilizing Language Models (LMs) and Pre-trained Language Models (PLMs) to directly generate synthetic data for NLU tasks.Wang et al. (2021bWang et al. ( , 2022) ) proposed the use of hard prompts and soft prompts to generate synthetic data.In a related study, Liu et al. (2021b); Zan et al. (2022) have uncovered the complementary nature of PLMs and classical approaches.

Methodology
Our approach revolves around extracting knowledge from the parameters of a Large Language Model (LLM) and leveraging it for the benefit of an extreme small-scale model through retrieval.We introduce the technique to extract in-domain knowledge from the LLM for each task by freezing its parameters and fine-tuning additional soft prompts using the corresponding dataset (section 3.1).To further enhance the generation quality, we introduce reinforcement learning and carefully design the reward function to fine-tune the soft prompts (section 3.2).Subsequently, we extract the in-domain knowledge from the LLM and create a knowledge store (section 3.3).Finally, we enable the small-scale model to retrieve and effectively utilize the knowledge generated by the LLM to successfully perform specific tasks (section 3.4).

Soft Prompt Tuning
To enhance the extraction of in-domain knowledge from the LLM, it is crucial to introduce appropriate prompts that precede the inputs, effectively guiding the text generation process (e.g."Generate rte examples according to the following keywords:").However, relying solely on manual design often leads to limited diversity and effectiveness in the generated knowledge.To overcome such limitations, we employ trainable parameters known as soft prompts, thereby replacing manual templates.We only need to update these soft prompts and keep all other parameters of the LLM fixed during training for each specific task.
More specifically, we utilize P-tuning v2 (Liu et al., 2022b), which involves incorporating a soft prompt into each Transformer (Vaswani et al., 2017) layer.The soft prompt is represented as a sequence of trainable vectors, denoted as P j = {p j 1 , . . ., p j k }, where j is the j-th layer and k is a hyperparameter which represents the prompt length.The i-th hidden states at the j-th layer, denoted as h j i , can be expressed as follows: where Trans() represents the Transformer layer, H j represents all hidden states at j-th layer and w i is the word embedding of the input text.
In order to generate diverse samples, we adopt the approach presented in PromDA (Wang et al., 2022), which involves generating samples from both the Input View and the Output View.The Input View is conditioned on the keywords present in the input texts, which are extracted by the unsupervised keyword extraction algorithm Rake (Rose et al., 2010).On the other hand, the Output View is conditioned on the labels.Additionally, we train two sets of soft prompts, namely P input and P output , which are specifically designed to generate samples from the Input View and the Output View respectively.Each sample S = (Y, X) generated by the LLM consists of two parts, namely the label Y = {y i } ly i=1 and the input text X = {x i } lx i=1 .The supervised learning object for soft prompts P input and P output can be formalized as: where K and Y represent the keywords of the input text and the label, respectively.

Reinforcement Learning
Constructing a knowledge store that combines accuracy and diversity is essential for enabling the small-scale model to effectively retrieve relevant information.However, traditional supervised learning methods are unable to optimize these learning objectives due to the lack of per-token differentiability.Therefore, we employ reinforcement learning to improve the generation quality.This approach involves guaranteeing that the generated texts align with the labels, thereby ensuring the accuracy of the knowledge store.Additionally, it ensures that the generated samples exhibit diversity, thus fulfilling the diversity requirement of the knowledge store.
More specifically, we utilize the original dataset to train a classification model, which serves as the reward model denoted as RM ().For each generated sample, comprising both input text and label, we assess the confidence score of the text's label using the reward model, thereby obtaining the accuracy reward R accuracy .Additionally, we evaluate the diversity of the generated samples by employing the Self-Bleu metric (Zhu et al., 2018), which calculates the Bleu score (Papineni et al., 2002) considering other samples as references.Subsequently, we calculate the diversity reward R diversity as 1 − b 3 , where b 3 represents the Bleu-3 metric.To avoid the generation of overly simplistic patterns by the LLM, we apply a length penalty.If the length of a generated sample falls below a certain threshold l min , the reward is multiplied by a discount factor: where l represents the length of the generated sample.The final reward for each generated sample is: where α is a hyperparameter to balance the weight of accuracy reward and diversity reward.Following previous studies (Ouyang et al., 2022;Bai et al., 2022), we employ Proximal Policy Optimization (PPO) (Schulman et al., 2017) as the reinforcement learning algorithm to fine-tune the soft prompts P input and P output .To ensure reliable performance in generating in-domain samples, we incorporate supervised gradients into the PPO gradients.Our objective is to minimize the following loss function: where L p and L v are actor loss and critic loss of the PPO algorithm respectively, L sft represents L input and L output for P input and P output respectively, and β is the hyperparameter which controls the strength of supervised gradients.

Knowledge Store
We employ the fine-tuned LLM to generate samples and construct a knowledge store for each task.Specifically, for each sample in the original dataset D, we utilize the keywords extracted by the Rake algorithm and labels as inputs for the model.Each sample is used m times, and in each iteration, n samples are generated.To ensure distinctiveness among the generated samples, we adopt p-sampling (Holtzman et al., 2020), which constrains tokens to the smallest possible set whose cumulative probability exceeds the probability parameter p.The set of samples generated from keywords is denoted as As illustrated in Figure 1, the key for each sample in the knowledge store is the sentence embedding produced by the small-scale model.These embeddings enable the model to retrieve relevant information from the knowledge store.The value of each sample is the probability distribution produced by the reward model RM () for each task.

Retrieval-based Knowledge Transfer
Due to the limited number of parameters in the small-scale model, it is not feasible to directly memorize all the knowledge extracted from the LLM.Therefore, we do not train the small-scale model directly with the generated samples.Instead, we leverage the knowledge in a retrieval manner.Specifically, we utilize the reward model RM () to provide the sentence representation f T (x i ) for each generated sample x i , which is then used to calculate cosine similarity scores with other generated samples in the mini-batch.The similarity score list obtained from the reward model can be denoted as {ϕ(f , where N represents the batch size and ϕ denotes the cosine similarity.Similarly, we can obtain the similarity score list S S i from the small-scale model.These similarity scores are then normalized in a listwise way to obtain the relevance distributions: where τ 1 and τ 2 are temperature hyperparameters to smooth the distributions. Our objective is to enable the small-scale model to learn how to retrieve relevant information from the knowledge store.To achieve this, we minimize the KL-divergence between the two relevance distributions ST i = {s T (x i , x j )} j∈[1,N ]∧j̸ =i and SS i = {s S (x i , x j )} j∈[1,N ]∧j̸ =i as the learning object for the small-scale model: During the inference of the small-scale model, we retrieve the k most relevant samples from the knowledge store for each test sample x, denoted as {x i } k i=1 .To determine the weight of each retrieved sample x i , we calculate the similarity score between the test sample x and the retrieved sample x i .The final prediction logit score for each class of the small-scale model is obtained as follows: where c is a specific class and p T c is the confidence score of the label c predicted by the reward model RM ().The final prediction of the small-scale model is the class with the largest logit score.We also report the quantity of parameters of each PLM (without embeddings).We re-implement all baseline models based on the released code provided by AD-KD (Wu et al., 2023), and incorporate BERT base as the TA model (w.TA).There are two versions of RetriKT, one of which is trained by Vanilla KD by generated samples (RetriKT-KD), and the other is trained by retrieval learning object (RetriKT-Retrieval).The best results of each backbone are shown in bold.Results are statistically significant with respect to all baselines on each student model (all p-value < 0.005).
4 Experimental Setup

Baselines
We compare our methods with Vanilla KD (Hinton et al., 2015) and recent strong KD methods including MSGKD (Liu et al., 2022a) and AD-KD (Wu et al., 2023), which are re-implemented based on the released code provided by AD-KD.We finetune EncT5 xl (Liu et al., 2021a) as the teacher model, and train BERT base using the Vanilla KD method as the assistant model (TA) for all baselines.We carry out grid-search of learning rate ∈ {1e-5, 2e-5, 3e-5, 4e-5}, batch size ∈ {4, 8, 16} for datasets COPA and CB, and batch size ∈ {16, 32, 64} for other datasets.The training epoch is set to 40 for CB and COPA, 20 for RTE and 10 for other datasets.We present the results on the validation set obtained from the best checkpoint during training.

Implementation
We implement all experiments with the deep learning framework PyTorch based on PromDA (Wang et al., 2022) and trl library (von Werra et al., 2020) on up to eight NVIDIA Tesla A100 GPUs (80GB memory).We build the LLM based on T5 xl with 3B parameters (Raffel et al., 2020), and translate each dataset into a single-sentence format according to the template provided by T5.The preprocessed example of each dataset is shown in Appendix B. We utilize two small-scale BERT models released by Turc et al. (2019), one with 4 Transformer layers, 512 hidden neurons and 8 attention heads, and the other with 2 Transformer layers, 512 hidden neurons and 8 attention heads.First, we tune the soft prompts P input and P output by supervised learning as the reference and initial models for PPO algorithm.The prompt length is set to 8 for all datasets.The learning rate and batch size is set to 1e-3 and 64, respectively.The training step is set to 10K for COPA and CB, 20K for RTE, and 40K for other datasets.We fine-tune EncT5 xl as the reward model for each dataset, which is also the teacher model for all baseline models for a fair comparison.Then we train the soft prompts through a combination of supervised learning and reinforcement learning by the loss function defined in Eq.( 6).We set α and β to 0.2 and 1 respectively.Generate number n and probability p for top-p sampling are always set to 5 and 0.9 respectively, while sample time m is set to 64 for COPA and CB, 16 for RTE, and 8 for other datasets.Finally, we train the small-scale BERT models with a grid search of learning rate ∈ {2e-5, 3e-5}, temperate combinations (τ 1 , τ 2 ) ∈ {(0.2, 0.1), (0.1, 0.05)}.The batch size is set to 128.There are two versions of RetriKT, one of which is trained by Vanilla KD by generated samples (RetriKT-KD), and the other is trained by retrieval learning object (RetriKT-Retrieval).More training details for PPO algorithm can be found in Appendix A

Main Results
As presented in Table 2, it is evident that both RetriKT-KD and RetriKT-Retrieval outperform previous KD methods across all backbones, which demonstrates the effectiveness of our approach.For instance, compared to the previous state-of-the-art method AD-KD, RetriKT-Retrieval demonstrates significant improvements: 7.34% on BERT 2 and 9.28% on BERT 4 .Notably, RetriKT-Retrieval consistently outperforms RetriKT-KD, especially on BERT 2 , indicating that retrieving relevant information from the knowledge store is more suitable for small-scale models rather than attempting to memorize all the knowledge.Furthermore, we observe comparable performance between Vanilla KD and Vanilla KD (w.TA), suggesting limitations in using the TA model for knowledge transfer.A more efficient approach for transferring knowledge from the LLM to the small-scale model is to prompt the LLM to generate in-domain knowledge, which can be retrieved by the small-scale model as relevant information for effective inference.

Ablation Study
To investigate the impact of different components of reinforcement learning in our approach, we con-

Detailed Analysis
Effect of the Sample Times We conduct experiments on the COPA dataset to investigate the impact of the sample time m on model performance.As shown in Figure 2, with a small value of m, the performance of both RetriKT-KD and RetriKT-Retrieval is comparable.However, as m increases, we observe a significant improvement in the performance of RetriKT-Retrieval compared to RetriKT-KD.This finding suggests that small-scale models, constrained by their limited parameters, are unable to effectively memorize such a large amount of knowledge.Therefore, retrieval-based knowledge transfer proves to be a more favorable ap-proach.Furthermore, as m continues to increase, we observe a corresponding increase in the performance of both RetriKT-KD and RetriKT-Retrieval, reaching a certain threshold.This indicates that increasing the number of generated samples can enhance the diversity of knowledge.However, due to the limited number of keywords extracted from the original dataset, the diversity of knowledge plateaus once the generation threshold is reached.Consequently, the model performance does not show further improvement beyond this point.

Effect of the Model Size
We conduct an investigation into the impact of model size on performance using two datasets, COPA and CoLA.
Our experimental results shown in Figure 3 reveal that when employing a smaller model, RetriKT-Retrieval consistently outperforms RetriKT-KD, showcasing the superiority of retrieval-based knowledge transfer for small-scale models.Additionally, we observe that both RetriKT-Retrieval and RetriKT-KD consistently outperform the training of the small-scale model solely fine-tuning by the original dataset.This observation further emphasizes the effectiveness of knowledge extraction from the LLM and its application in enhancing model performance.

Accuracy and Diversity
In this section, we examine whether the LLM trained through reinforcement learning can generate more accurate and diverse knowledge.To evaluate the diversity of knowledge, we employ the Self-Bleu metric on the generated samples, while the accuracy of knowledge is measured using the cross-entropy between the probability distribution predicted by the reward model and the label generated by the LLM.In both metrics, smaller values indicate better performance.
As shown in Table 4, the results demonstrate that training the LLM through reinforcement learning leads to the generation of more accurate and diverse knowledge across all datasets.This outcome highlights the effectiveness of our reward function de-sign.Additionally, we observe that larger datasets tend to yield samples with smaller Self-Bleu metrics.We conjecture that this phenomenon is a result of the larger dataset's ability to extract a wider range of diverse keywords, thereby enabling the generation of more varied knowledge.

Conclusion
Our study tackles the task of compressing LLMs and introduces a pioneering compression paradigm called Retrieval-based Knowledge Transfer.This approach efficiently transfers the knowledge of LLMs to small-scale models by creating a knowledge store, enabling the small model to retrieve pertinent information during inference.Through extensive experiments conducted on commonly used benchmarks, we demonstrate that our framework substantially improves the performance of smallscale models by leveraging the knowledge contained within LLMs.In future research, we plan to investigate the application and performance of our proposed method on even larger language models, such as T5-11B.

Limitations
In this section, we discuss the limitations of our work as follows.Firstly, due to limited computational resources, we did not attempt experiments with larger models such as T5-11B.Furthermore, resource constraints constrained our grid search for hyperparameters within a limited range, which potentially leaves room for enhancing the metrics showcased in this paper.Secondly, our proposed method requires the establishment of a knowledge store.Compared to other model compression methods, our approach introduces additional storage space and incurs slightly additional time for retrieval

A Training Details for PPO
In this section, we present the training details of PPO in Table .The meaning of these hyperparameters is detailed in trl library (von Werra et al., 2020) and NLPO (Ramamurthy et al., 2022).

B Preprocessed Examples
In this section, we provide examples of our preprocessing for each of the data sets we consider.

C Faithfulness of the Generated Knowledge
To further validate the faithfulness of the generated knowledge, we gathered five volunteers to assess the faithfulness of the generated texts for datasets WiC, COPA, and CoLA.Each text will be given a rating from 1 to 5. We randomly sampled 100 texts from each dataset and calculate the average as the  final faithfulness score.The results of the faithfulness evaluation are shown in Table 6, showing that the faithfulness of generated texts for each dataset is quite high.

Figure 1 :
Figure 1: The framework of RetriKT which consists of three steps: (1) fine-tune additional soft prompts for the LLM by supervised learning; (2) further fine-tune the soft prompts by reinforcement learning to enhance the generation quality; (3) create the knowledge store based on the knowledge extracted from the LLM, and teach the small-scale model how to retrieve relevant knowledge from it.(Retrieval-based Knowledge Transfer) while the set generated from labels is denoted as D O .The quantity of both D I and D O is mn|D|, where |D| represents the quantity of the original dataset D.To further enhance the diversity of generated samples, we utilize the label of each sample in D I as the model input to generate a sample each time, resulting in the sample set D IO .Similarly, we use the keyword of each example in D O as the model input to generate a sample each time, resulting in the sample set D OI .The final knowledge store is composed of these sample sets, namely D D I D O D IO D OI , which theoretically yields a total of (1 + 4mn)|D| samples.However, there is a possibility of duplicates in the generated samples, so we remove them.

Figure 2 :Figure 3 :
Figure2: Effect of the sample times m.Results are the model performance on the COPA dataset using BERT 2 and BERT 4 .The Finetune method is trained by the original dataset so it is independent of m.

Table 1 :
The data statistics and metrics of all tasks.

Table 2 :
Main experimental results (%) on six low-resource tasks.

Table 3 :
Ablation studies of different components utilizing BERT 2 and BERT 4 as the small-scale models on datasets CB, COPA and CoLA.

Table 4 :
Diversity and accuracy analysis for the generated knowledge.Self-Bleu reflects the diversity of knowledge, while cross entropy reflects the accuracy of knowledge.In both metrics, smaller values indicate better performance.

Table 5 :
Processed input: hypothesis: Valence was helping premise: Valence the void-brain, Valence the virtuous valet.Why couldn't the figger choose his own portion of titanic anatomy to shaft?Did he think he was helping?Hyperparameters for PPO

Table 6 :
The faithfulness score of the generated texts for datasets WiC, COPA and CoLA.