LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of tokens. To accelerate model inference and reduce cost, this paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity under high compression ratios, a token-level iterative compression algorithm to better model the interdependence between compressed contents, and an instruction tuning based method for distribution alignment between language models. We conduct experiments and analysis over four datasets from different scenarios, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23; showing that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss. Our code is available at https://aka.ms/LLMLingua.


Introduction
The widespread adoption of ChatGPT has transformed numerous scenarios by harnessing the powerful generalization and reasoning capabilities of large language models (LLMs).In practical applications, crafting suitable prompts is crucial and usually involves techniques such as chain-ofthought, in-context learning, and retrieving related documents or historical conversations (Wei et al., 2022;Chase, 2022).While these methods can elicit highly effective generations by activating LLMs' domain-specific knowledge, they often require longer prompts.Therefore, striking a balance between the massive computational demands of LLMs and the need for longer prompts has become an urgent issue.Some studies attempt to accelerate model inference by modifying the parameters of LLMs through quantization (Dettmers et al., 2022;Xiao et al., 2023), compression (Frantar and Alistarh, 2023), etc.However, these approaches may be not suitable when the LLMs can be accessed via APIs only.
Approaches that attempt to reduce the length of original prompts while preserving essential information have emerged lately.These approaches are grounded in the concept that natural language is inherently redundant (Shannon, 1951) and thus can be compressed.Gilbert et al. (2023) also indicate that LLMs can effectively reconstruct source code from compressed text descriptions while maintaining a high level of functional accuracy.Therefore, we follow this line of studies to compress a long prompt into a shorter one without any gradient flow through the LLMs to support applications based on a larger range of LLMs.
In terms of information entropy, tokens with lower perplexity (PPL) contribute less to the overall entropy gains of the language model.In other words, removing tokens with lower perplexity has a relatively minor impact on the LLM's comprehension of the context.Motivated by this, Li (2023) propose Selective-Context, which first employs a small language model to compute the selfinformation of each lexical unit (such as sentences, phrases, or tokens) in original prompts, and then drops the less informative content for prompt compression.However, this method not only ignores the interdependence between the compressed contents but also neglects the correspondence between the LLM being targeted and the small language model used for prompt compression.
This paper proposes LLMLingua, a coarseto-fine prompt compression method, to address the aforementioned issues.Specifically, we first present a budget controller to dynamically allocate different compression ratios to various components in original prompts such as the instruction, demonstrations, and the question, and meanwhile, perform coarse-grained, demonstration-level compression to maintain semantic integrity under high compression ratios.We further introduce a tokenlevel iterative algorithm for fine-grained prompt compression.Compared with Selective Context, it can better preserve the key information within the prompt by taking into account the conditional dependencies between tokens.Additionally, we pose the challenge of distribution discrepancy between the target LLM and the small language model used for prompt compression, and further propose an instruction tuning based method to align the distribution of both language models.
We validate the effectiveness of our approach on four datasets from different domains, i.e., GSM8K and BBH for reasoning and ICL, ShareGPT for conversation, and Arxiv-March23 for summarization.The results show that our method yields state-ofthe-art performance across the board.Furthermore, we conduct extensive experiments and discussions to analyze why our approach attains superior performance.To our best knowledge, we are the first to evaluate reasoning and ICL capabilities in the domain of efficient LLMs.

Efficient LLMs
Efficient large language models have gained significant attention in recent research community, especially with the growing prominence of Chat-GPT.Most of these methods aim to reduce the costs of inference and fine-tuning by modifying the model parameters through quantization (Dettmers et al., 2022;Frantar et al., 2023;Xiao et al., 2023), compression (Frantar and Alistarh, 2023), instruct tuning (Taori et al., 2023;Chiang et al., 2023;Xu et al., 2023), or delta tuning (Hu et al., 2022).
A line of studies attempt to optimize inference costs from the perspective of the input prompts.Motivated by the observation of the abundance of identical text spans between the input and the generated result, Yang et al. (2023) directly copy tokens from prompts for decoding to accelerate the inference process of LLMs.Some approaches focus on compressing prompts, specifically, learning special tokens via prompt tuning of LLMs to reduce the number of tokens to be processed during inference (Mu et al., 2023;Ge et al., 2022;Wingate et al., 2022;Chevalier et al., 2023;Ge et al., 2023).Unfortunately, these methods are usually tailored to particular tasks and some of them (Mu et al., 2023;Chevalier et al., 2023) even require to fine-tune the whole language model, which severely limits their application scenarios.Furthermore, there are some studies (Chase, 2022;Zhang et al., 2023) that attempt to utilize LLMs to summarize dialog or data, thereby forming memory and knowledge.However, these approaches require multiple invocations of LLMs, which are quite costly.
Some methods reduce the prompt length by selecting a subset of demonstrations.For example, Zhou et al. ( 2023) introduces a reinforcement learning based algorithm to allocate a specific number of demonstrations for each question.Some other methods focus on token pruning (Goyal et al., 2020;Kim and Cho, 2021;Kim et al., 2022;Rao et al., 2021;Modarressi et al., 2022) and token merging (Bolya et al., 2023).However, these approaches are proposed for smaller models such as BERT, ViT.Moreover, they depend on fine-tuning the models or obtaining intermediate results during inference.
The most similar work to this paper is Selective-Context (Li, 2023), which evaluates the informativeness of lexical units by computing selfinformation with a small language model, and drops the less informative content for prompt compression.This paper is inspired by Selective-Context and further proposes a coarse-to-fine framework to address its limitations.

Out-of-Distribution (OoD) Detection
Recently, a series of studies have been proposed for unsupervised OoD detection.With only indistribution texts available for learning, these methods either fine-tune a pre-trained language model (Arora et al., 2021) or train a language model from scratch (Mai et al., 2022).Wu et al. (2023) analyze the characteristics of these methods and leverage multi-level knowledge distillation to integrate their strengths while mitigating their limitations.Finally, perplexity output by the resulting language model is used as the indication of an example being OoD.
This paper also regards perplexity as a measurement of how well a language model predicts a sample.In contrast to out-of-distribution detection, which identifies examples with high perplexities as indicative of unreliable predictions, we consider tokens with higher perplexity to be more influential during the inference process of language models.

LLMs as a Compressor
Recently, many perspectives have interpreted large language models and unsupervised learn-

Problem Formulation
A prompt compression system is designed to generate a compressed prompt x = {   }  =1 from a given original prompt x = (x ins , x dems , x que ), where x ins = { ins  }  ins =1 , x dems = { dems  }  dems =1 , and =1 denote the instruction, demonstrations, and the question in the original prompt x. ,  ins ,  dems , and  que represent the numbers of tokens in x, x ins , x dems , and x que , respectively.Let  =  ins +  dems +  que denote the total sequence length of x, the compression rate is defined as  = /,  ∈ [0, 1], and the compression ratio is 1/.A smaller value of  implies a lower inference cost, which is preferable.Let x  represent the LLM-generated results derived by x and x  denotes the tokens derived by x, the distribution of x  is expected to be as similar to x  as possible.This can be formulated as:

Methodology
In this section, we elaborate on the proposed coarseto-fine prompt compression approach, LLMLingua.First, we introduce a budget controller to dynamically allocate different compression ratios to various components in prompts and meanwhile, perform coarse-grained, demonstration-level compression to maintain semantic integrity under high compression ratios.Next, we describe the proposed iterative prompt algorithm designed to retain knowledge from the prompt while compressing.Finally, we introduce alignment to address the distribution gap between the small model and black-box large models.Figure 1 show the framework.

Budget Controller
The budget controller here is designed to allocate different budgets, i.e., compression ratio, to different components in a prompt such as instructions, demonstrations, and questions, at the sentence or demonstration level.There are two considerations: (i) In general, the instruction and the question in a prompt have a direct influence on the generated results, as they should contain all the necessary knowledge to generate the following answer.On the contrary, if there are multiple demonstrations in the original prompt, the conveyed information may be redundant.Therefore, a tailored budget controller is required to allocate more budget (i.e., Algorithm 1 Pseudo code of Budget Controller. Input: A small language model M  ; the original prompt x = (x ins , x dems , x que ).
1: Set the selected demonstration set D = .
2: Get demonstration compression rate  dem by Eq.( 2).3: Calculate the perplexity of each demonstration via M  .4: Rank all demonstrations in descending order of their perplexity as a list (x dem (1) , ..., x dem (  ) ), where  is the number of demonstrations, x dem () is the -th demonstration.5: for  = 1 do 6: if  D >  •  dems  dems then 7: Break.8: end if 9: Append x dem () to D. 10:  =  + 1 11: end for 12: Allocate remaining budget to x ins and x que via Eq.( 3).Output: The subset of demonstrations D obtained from coarse-grained compression; Additional budget Δ ins,que for the instruction and the question.smaller compression ratios) for instructions and questions, and less budget for demonstrations.
(ii) When a high compression ratio is required, token-level dropout as in Li (2023) might make the compressed prompts too trivial and thus lose vital information from the original prompt.Consequently, sentence-level dropout should be employed instead to preserve a certain degree of linguistic integrity.Especially in the case of multiple redundant demonstrations, we can even perform demonstration-level control to meet the compression requirement.
Algorithm 1 illustrates the overall procedure of the budget controller.

Derive compression ratio for demonstrations.
We first compute the compression rate for demonstrations  dems according to the target overall compression rate  and the pre-defined compression rate for instructions and questions, i.e.,  ins and  que , respectively.
Demonstration-level prompt compression.
With the derived  dems for demonstrations, we then perform a coarse-grained demonstration-level prompt compression: we construct D, a subset of demonstrations from x dems .Specifically, we first employ a small language model M  , such as GPT-2 or LLaMA, to compute the perplexity of each demonstration in x dems .Then, we select demonstrations in descending order of their perplexity values, until adding one more

Algorithm 2 Pseudo code of Iterative Token-level Prompt Compression (ITPC).
Input: A small language model M  ; the prompt from budget controller x ′ = (x ins , x D , x que ); target compression rate, adjusted compression rate △ ins,que .
1: Set the selected token set T =  2: Get segment set S.
demonstration to D will make the total number of tokens in D exceed maximum tokens  • dems  dems , where  is the granular control coefficient.
Adjust compression ratios for instruction and question.After obtaining the coarse-grained =1 , we allocate the remaining budget to the instruction and the question: where  D denote the total number of tokens in D.

Iterative Token-level Prompt Compression
Utilizing perplexity for prompt compression encounters the intrinsic limitation, i.e., the independence assumption, similar to the shortcomings of the Mask Language Model (Yang et al., 2019) as: where x ′ = (x ins , x D , x que ) is the original prompt after demonstration-level compression; x D is the concatenation of all demonstrations in D;  is the final compressed prompt;  < and  < denote the preserved and compressed tokens before the -th token   ;  ′ and  denote the numbers of all tokens in x ′ and x, respectively.
Here we propose an iterative token-level prompt compression (ITPC) algorithm to mitigate the inaccuracy introduced by the conditional independence assumption.Algorithm 2 shows the pseudo codes.
Specifically, we first divide the target prompt x ′ into several segments S = {s 1 , s 2 , ..., s  }.And then, we use the smaller model M  to obtain the perplexity distribution of all segments.The compressed prompt obtained from each segment is concatenated to the subsequent segment, enabling more accurate estimation of the conditional probability.The corresponding probability estimation function can be formulated as: where  , denotes the -th token in the -th segment,  ,  and  ,  represent the token length of th original and compressed segment, respectively.
When the conditional probabilities for each segment (s  ) are obtained, the compression ratio threshold   w.r.t.s  are dynamically calculated based on the PPL distribution and the corresponding compression ratio  s  , where Finally, tokens in each s  with the PPL greater than   are retained in the compressed prompt.

Distribution Alignment
To narrow the gap between the distribution of the LLM and that of the small language model used for prompt compression, here we align the two distributions via instruction tuning.Specifically, we start from a pre-trained small language model M  and use the data generated by the LLM to perform instruction tuning on M  .The optimization of M  can be formulated as: where  M  denotes the parameters of M  , (x  , y LLM

𝑖
) denotes the pair of instruction x  and the LLM generated texts y LLM  ,  is the number of all examples used for instruction tuning.

Settings
Datasets To comprehensively assess the effectiveness of compressed prompts in retaining LLM abilities, we evaluated their performance across four datasets.For reasoning and in-context learning (ICL), we use GSM8K (Cobbe et al., 2021) and BBH (Suzgun et al., 2022).As for contextual understanding, we use ShareGPT (sha, 2023) for conversation and Arxiv-March23 (Li, 2023) for summarization.It's worth noting that neither the small LM nor the target LLMs used in this paper have seen any of the evaluation datasets, especially the last two which were newly collected this year.We followed the experimental setup of previous work (Fu et al., 2023a;Li, 2023) for the usage of these datasets.Please refer to Appendix A.1 for detailed information.
Implementation Details In this paper, we employ the GPT-3.5-Turbo-0301and the Claude-v1.3as the target LLMs, which can be accessed via Ope-nAI2 and Claude API3 .To improve the stability of outputs produced by LLMs we apply greedy decoding with a temperature of 0 across all experiments.The Alpaca dataset (Taori et al., 2023) is exclusively employed for aligning small language models with black-box LLMs, and is not utilized in the evaluation process.In our experiments, we utilize either Alpaca-7B4 or GPT2-Alpaca as the small pre-trained language model M  for compression.We implement our approach based on Py-Torch 1.12.05 and Huggingface's Transformers6 .We set the granular control coefficient  to 2. We use the pre-defined compression rates  ins = 0.85 and  que = 0.9 for instructions and questions.The segment size used in the iterative token-level compression is set to 100.Baselines We consider the following baselines: • GPT4-Generation: Instruct GPT-4 to compress the original prompt.We used ten sets of instructions here and reported the best results.Appendix C displays the instructions we employed.
• Random Selection: Random select the demonstrations or sentences of the original prompt.
• Selective-Context (Li, 2023): Use the phraselevel self-information from a small language model to filter out less informative content.We use the same small LM, i.e., Alpaca-7B for a fair comparison.

Main Results
Table 1 and 2 report the results of our approach alongside those baseline methods on GSM8K, BBH, ShareGPT, and Arxiv-March23.It can be seen that our proposed method consistently outperforms the prior methods by a large margin in almost all experiments.Specifically, on GSM8K and BBH, the reasoning and in-context learning-related benchmark, our method even achieves slightly higher results than the full-shot approach, while also delivering impressive compression ratios (1/) of 5x and 3x respectively, with the 1-shot constraint.This well demonstrates that our compressed prompts effectively retain the reasoning information contained in the original prompt.As the compression ratio increases, i.e., under the half-shot and quarter-shot constraints, the performance experiences a slight decline.For instance, on GSM8K, the EM scores will decrease by 1.44 and 1.52, respectively, despite compression ratios as high as 14x and 20x.On BBH, our approach achieves compression ratios of 5x and 7x with the EM score decreasing by 8.5 and 13.2 points, respectively.In fact, this performance is already quite satisfactory, as it approaches the score of 62.0 achieved by PaLM-540B in half-shot constraint.Our case study reveals that this declined performance on BBH is mainly due to challenging reasoning tasks, such as track-ing_shuffled_objects_seven_objects.
Moreover, on ShareGPT and Arxiv-March23, two contextual understanding benchmarks, we can see that our approach achieves acceleration ratios of 9x and 3.3x with a high BERTScore F1, indicating that our approach successfully retains the semantic information of the initial prompts.

Analysis on Reasoning & ICL Tasks.
Here we analyze the performance of our approach and baseline methods on the difficult reasoning and in-context learning (ICL) benchmarks GSM8K and BBH.
We notice that our approach shows significant performance improvements over the strong baseline Selective-Context under all settings.We conjecture that, as relying on phrase-level selfinformation, Selective-Context is prone to lose critical reasoning information during the chain-ofthought process.Especially on GSM8K, its performance is lower than ours by 33.10 points at a compression ratio of 20x.The inferior performance of Sentence Selection suggests that it may face similar issues of fragmentary reasoning logic.Surprisingly, though GPT-4 has demonstrated its strong text generation capability, the suboptimal performance on prompt compression indicates that the generated prompts may omit crucial details from the original prompt, particularly reasoning steps.
In addition to the findings mentioned above, the experiments also demonstrate that our method can preserve the ICL capacity of prompts for LLMs.Compared to the zero-shot results, our approach exhibits significant performance improvements of 51.55 and 24.53 even with the largest compression ratios.Notably, on GSM8K, our 20x compressed prompt outperforms the 8-shot 3-step CoT by 2.43, further suggesting that our method can effectively retain the reasoning information.

Ablation
To validate the contributions of different components in our approach, we introduce five variants of our model for ablation study: i) Ours w/o Iterative Token-level Compression, which performs token-level compression in a single inference rather than iteratively.ii) Ours w/o Budget Controller, which directly employs ITPC with the same compression ratio for all components.iii) Ours w/o Dynamic Compression Ratio, which uses the same compression ratio for all components.iv) Ours w/ Random Selection in Budget Controller, which randomly selects demonstrations or sentences for demonstration-level prompt compression.v) Ours w/o Distribution Alignment, which removes the distribution alignment module of our approach and directly use the pre-trained LLaMA-7B as the small language model.vi) Ours w/ Remove Stop Words, which removes the stop words in original prompts using NLTK 7 .Table 3 shows the results.
Comparing Ours with w/o Iterative Token-level Prompt Compression, we observe a significant decline in Exact Match when the conditional dependence between compressed tokens is not consid-  ered.We conjecture this variant may lose essential information in the prompt, especially for lowfrequency keywords that frequently appear in the given prompt.When comparing Ours with w/o Dynamic Compression Ratio and with w/o Budget Controller, it reveals that different components of the prompt exhibit varying sensitivity.Instructions and questions necessitate a lower compression ratio.To balance the relationship between compression ratio and language integrity, introducing a demonstration or sentence-level filter better preserves sufficient linguistic information, even at higher compression ratios.Ours w/ Random Selection in Budget Controller indicates that selecting sentences or demonstrations based on perplexity can better identify information-rich sentences for target LLMs.Distribution Alignment allows small LMs to generate distributions that more closely resemble those of target LLMs, resulting in a further improvement of 0.56 on GSM8K.

Discussion
Different Target LLMs Here we test our method with Claude-v1.3as the target LLM to demonstrate its generalizability across different black-box LLMs in addition to the GPT series models.Due to the limitation of API cost, we only consider the scenarios with one-shot constraint and half-shot constraint.Similarly, we employe Alpaca-7B as the small language model for the challenges in collecting alignment data.As shown in Table 4, our method can achieve improvements over the simple prompt by 0.8 and 1.7 EM points with compression ratios of 5x and 14x, respectively.
EM Tokens 1/  Different Small LMs We further test our approach with different small language models: we fine-tune the GPT2-small on the Alpaca dataset and use it as the small LM for our system.As shown in Table 5, the results obtained by Alpaca finetuned GPT2-small are weaker than those obtained by Alpaca-7B with a performance drop of 2.06, 0.99, and 1.06 EM points at different compression ratios.This is due to the significant distribution discrepancy between the small LM and the target LLM.
Even with distribution alignment, it is still difficult to directly estimate the target LLM using the distribution from the small language model.Similar observations have been reported in Li (2023).However, benefiting from the proposed budget controller and the iterative token-level prompt compression algorithm, our approach achieves satisfactory results in difficult tasks such as reasoning even with the less powerful GPT2-Small as the small language model.

The Generation Results of Compressed Prompt
Appendix E displays several compressed prompts along with following generation texts.It is evident that the compressed prompts can still guide the generation of multi-step reasoning outcomes similar to the original ones.In contrast, prompts compressed using Selective-Context exhibit errors in reasoning logic.This highlights the effectiveness of our method in preserving crucial semantic information while retaining reasoning capabilities.
As depicted in Figure 2, we also analyze the relationship between the compression ratio and the length of the corresponding generated texts.It can be observed that as the compression ratio increases, the text length produced by target LLMs tends to decrease, albeit with varying degrees across different datasets.This indicates that prompt compression not only saves computational resources in the input but also contributes to computational savings in the generation stage.Lingua: the number of tokens involved in computation and the end-to-end latency.

Overhead of LLMLingua
The overall computation of our system is the sum of the prompt compression and the following inference.This can be formulated as: where  small and  LLMs represent the per token computation load of the small LM and LLM, respectively.,  /, and / are the numbers of token inferences for the budget controller, the perplexity calculation of tokens to compress in ITPC, and the conditioned perplexity calculation of compressed results in ITPC (using KV cache), respectively.Assuming that the small LM has the same system optimizations as the LLMs, such as the use of FasterTransformer8 and quantization techniques, we can estimate the ratio between  small and  LLMs based on model parameters:  small ≈ 7/175 LLMs = 1/25 LLMs .When  = 5, we have That is, we can achieve nearly 4x savings in computational resources when using the smaller LM with a prompt compression rate of 5x.Table 6 shows the end-to-end latency of different systems on a V100-32G GPU with a compression rate from 1x to 10x.We can see that LLMLingua has a relatively small computation overhead and can achieve a speedup ranging from 1.7x to 5.7x.
Recovering the Compressed Prompt using LLMs Appendix D shows some examples restored from the compressed prompts by using GPT-49 .It is evident that LLMs can effectively comprehend the semantic information in the compressed prompts, even if it might be challenging for humans.Additionally, we notice that how much information GPT-4 can recover depends on the compression ratio and the small language model we use.For instance, in Figure 4, the prompt compressed using Alpaca-7B is restored to its complete 9-step reasoning process, while in Figure 5, the prompt compressed with GPT2-Alpaca can only be restored to a 7-step reasoning process, with some calculation errors.
Compare with Generation-based Methods We do not develop our approach based on LLM generation primarily for three reasons: i) The content and length of the generated text are uncontrollable.Uncontrollable length requires more iterations to satisfy the constraint of the compression ratio.Uncontrollable content leads to low overlap between the generated text and the original prompt, particularly for complex prompts with multi-step inference, which may lose significant amounts of reasoning paths or even generate completely unrelated demonstrations.ii) The computational cost is high.Small language models struggle to handle such complex tasks, and using models like GPT-4 for compression would further increase computational overhead.Moreover, even powerful generation models like GPT-4 struggle to retain effective information from prompts as shown in Table 2.
iii) The compressed prompts obtained from generation models are complete and continuous sentences, usually resulting in a lower compression ratio compared to our coarse-to-fine method.

Compare with Prompt Engineering methods
Our method is orthogonal to Prompt Engineering methods, such as prompt retrieval and prompt ordering.Our work focuses on compressing welldesigned prompts, and it performs well on complex and fine-tuned prompts like GSM8K.Moreover, the perplexity-based demonstration filtering method used in our budget controller can also be applied to scenarios such as prompt retrieval.This demonstrates the compatibility and adaptability of our approach in various LLMs settings.

Conclusion
We introduce a coarse-to-fine algorithm for prompt compression, named LLMLingua, which is based on the small LM's PPL for black-box LLMs.Our approach consists of three modules: Budget Controller, Iterative Token-level Compression, and Alignment.We validate the effectiveness of our approach on 4 datasets from different domains, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23, demonstrating that our method achieves state-ofthe-art performance across all datasets, with up to 20x compression with only a 1.5 point performance drop.Moreover, we observe that LLMs can effectively restore compressed prompts, and prompt compression contributes to a reduction in generated text length.Our approach holds substantial practical implications, as it not only reduces computational costs but also offers a potential solution for accommodating longer contexts in LLMs.The method of compressing prompts has the potential to enhance downstream task performance by compressing longer prompts and to improve the LLMs's inference efficiency by compressing the KV cache.

Limitations
There are also some limitations in our approach.For instance, we might observe a notable performance drop when trying to achieve excessively high compression ratios such as 25x-30x on GSM8K, as shown in Figure 3.It is shown that as the compression ratio increases especially around 25x-30x, all methods as well as ours will experience a substantial performance drop.In comparison with other methods, this performance drop derived from our approach is significantly shifted to much higher compression ratios.We owe this to the Budget Controller and the Iterative Token-level Prompt Compression algorithm, which enable our method to maintain the original prompt information even at some extreme compression ratios.The upper limit of the compression ratio for different prompts varies, depending on factors such as prompt length, task type, and the number of sentences involved.
Additionally, there may be subtle differences between the tokenizers used by the small language model and the black-box LLM, which may result in an underestimation of the prompt's token length.

A Experiment Details
A.1 Dataset Details GSM8K A widely used math reasoning dataset comprising 8,000 problems, including a 1,300 problems test set that assesses models' capabilities in arithmetic reasoning and formulating mathematical steps using language (Cobbe et al., 2021).For this dataset, we employ the complex multi-step CoT prompt (Fu et al., 2023b) 10 as the original prompt.
BBH A suite of language and symbolic reasoning tasks, consisting of 6,500 problems across 23 subsets, specifically designed to evaluate chain-ofthought prompting.In our experiment, we adopt the 3-shot CoT prompt 11 as the original prompts, following the approach described by Suzgun et al. (2022).
ShareGPT A conversation dataset from ShareGPT.complatform (sha, 2023) which includes users sharing conversations with ChatGPT in different languages and in various scenarios (e.g., coding, chitchat, writing assistant, etc.).We use a dataset of 575 samples provided by Li (2023) as our test set.We use all dialogues except the final round as the prompt and generate results with GPT-3.5-Turbo as the reference.
Arxiv-March23 A dataset consisting of latest academic papers created in March 2023 from the arXiv preprint repository.We use 500 data items collected by Li (2023) as the test set.Due to the excessive length of some articles, we take the first five sections of each article and truncate each section to 10,000 characters.Then, we concatenate these sections to form the original prompt and use GPT-3.5-Turbo to generate the summary as the reference.

A.2 Other Implementation Details
All experiments were conducted using a Tesla V100 (32GB).We trained the GPT2-Alpaca model on the Alpaca dataset12 for eight epochs using a learning rate of 1e-4 and the AdamW optimizer (Loshchilov and Hutter, 2019).The training process took approximately 150 minutes to complete.We use tiktoken13 and GPT-3.5-Turbomodel to count all the tokens.Table 7: The inference costs($) for various datasets using GPT-3.5-Turbo.

B Economic Cost
Table 7 displays the estimated inference costs for various datasets, according to the pricing of GPT-3.5-Turbo.Our approach showcases significant savings in computational resources and monetary expenditures, with cost reductions of $4.7, $8.0, $0.4, and $0.8 observed in the GSM8K, BBH, ShareGPT, and Arxiv datasets, respectively.

C Instructions used in GPT-4 Generation
The instructions we used in the GPT-4 Generation are shown below: 1. Could you please rephrase the paragraph to make it short, and keep 5% tokens?
2. Condense the passage to retain only 5% of its original tokens, while preserving its meaning.
3. Short the sentences to 200 tokens.

4.
Trim the text down to 200 tokens in total.
5. Please provide a concise summary of the given examples in several sentences, ensuring that all reasoning information is included.
6. Summarize the provided examples in a few sentences, maintaining all essential reasoning aspects.

7.
Remove redundancy and express the text concisely in English, ensuring that all key information and reasoning processes are preserved.
8. Eliminate repetitive elements and present the text concisely, ensuring that key details and logical processes are retained.9. Follow these steps to shorten the given text content: 1.First, calculate the amount of information contained in each sentence, and remove sentences with less information.2. Next, further condense the text by removing stop words, unnecessary punctuation, and redundant expressions.Refine the content while ensuring that all key information is retained.
Let's do it step by step.10.To shorten the given text, follow these steps: a) Determine the information value of each sentence and remove those with lower value.b) Further reduce the text by removing stop words, unneeded punctuation, and superfluous expressions, while making sure to keep all vital information intact.Let's do it step by step.

D Recovering Compressed Prompts with Large Language Model
In this section, we showcase several examples of employing black-box LLMs to reconstruct compressed prompts.Specifically, we have selected three compressed prompts with varying compression ratios, produced by distinct small language models, on different datasets.These prompts, accompanied by guiding instructions, will serve as input for the GPT-4 model.

E Cases Study
We present various cases from multiple datasets, encompassing compressed prompts, outcomes derived from original prompts, outcomes derived from compressed prompts, and results achieved utilizing the selective-context approach.
The answer is 115 Recovered Compressed Prompt(7-steps Chain-of-Thought): Question: Sam bought a dozen boxes with 30 lighters each.He arranged them into groups of three and sold them separately.In total, he spent $120 on the lighters.How much profit did he make?
Step 2: Sam took 6 boxes of lighters and sold them for $15.
Step 4: These 330 lighters form groups of three, so he made 110 groups * $2 per group = $220 from them.
The answer is $115.

Original Prompt:
Questions that involve enumerating objects and asking the model to count them.Q: I have a blackberry, a clarinet, a nectarine, a plum, a strawberry, a banana, a flute, an orange, and a violin.How many fruits do I have?A: Let's think step by step.
Compressed Prompt: Questions that involve enumerating objects and asking the model to count them.: a blackberry, a a ne a a a a, many have :'s think We first theruits the list and include their in -(-'s the numbers in parentheses:1 + 1 = 6.So the answer is 6.Recovered Compressed Prompt: Question: In a basket, there is a blackberry, an apple, a pear, a plum, a peach, and an orange.How many fruits are in the basket?Let's think step by step.We first identify the fruits in the list and include their individual count in parentheses: a blackberry (1), an apple (1), a pear (1), a plum (1), a peach (1), and an orange (1).Now, let's add the numbers in parentheses: 1 + 1 + 1 + 1 + 1 + 1 = 6.So the answer is 6.There are 6 fruits in the basket.
Compressed Prompt: Please provide the example of the radiation beam spectrum in the TOPAS format, and I will learn from it.
The beam energy spectrum is composed of 12 energy levels with corresponding weights.dv:So/SARRPbeam/BeamEnergySpectrumValues: Lists the energy values in the spectrum, separated by tabs.
It is that the meanshiftift is to theift of the meanshift to the precision of the numericalations and that this is to a high of cancellation between two dominant components theshiftift.This result is contrary to earlier findings based on inhomogeneous cosmological models exhibiting cosmic backreaction.
For simplicity, the results neglect contributions from optical drift.
Based on a study of the redshift drift in a Lemaitre-Tolman-Bondi model, the optical drift effects are estimated to be at most of order 10% of the redshift drift signal.
The Generation of Full-shot Prompt: The article discusses the computation of redshift drift along light rays in a simulated universe based on the Newtonian N-body simulation code GADGET-2 combined with a perturbed Friedmann-Lemaitre-Robertson-Walker metric in the Newtonian gauge.The study finds that the mean redshift drift is equal to the drift of the mean redshift to the precision of the numerical computations, and this is due to a high degree of cancellation between two dominant components of the redshift drift.The study neglects contributions from optical drift, but it is estimated that these effects are at most of order 10% of the redshift drift signal.The study also finds that the redshift drift contribution from peculiar acceleration of the emitter is negligible in the simulation setup, but it is expected that this contribution will be larger for real observations.The Generation of Compressed Prompt: The document discusses the computation of redshift drift in a simulated universe using the Nbody simulation code GGET2 with a perturbed Lemaître-Lemaître metric.The results show that the mean shift is equivalent to the shift of the mean to a high degree of cancellation between two dominant components.This contradicts earlier findings based on inhomogeneous cosmological models exhibiting cosmic backreaction.The document also estimates that optical drift effects are at most 10% of the redshift drift signal.
Follow the given examples and answer the question.Demonstration 1: Q: In a certain school, 2/3 of the male students like to play basketball, ….What percent of the population of the school do not like to play basketball if the ratio of the male to female students is 3:2 and there are 1000 students?Let's think step by step The students are divided into 3 + 2 = 5 Each part represents 1000/5 = 200 students.So, there are 3 x 200 = 600 males.And there are 2 x 200 = 400.…basketball is 520/1000 * 100 = 52.The answer is 52.Demonstration 2: … Demonstration 8: Q: Sam bought a dozen boxes, each with 30 highlighter pens inside,… The answer is 115.Question: Janet's ducks lay 16 eggs per day….. How much in dollars does she make every day at the farmers' market?

Figure 3 :
Figure 3: The performance of various prompt compression methods at different compression ratios (1/) on GSM8K.The dashed line corresponds to the Exact Match score obtained from the full-shot prompt.
The Generation of Selective-Context: The document discusses the use of redshift drift measurements, which were first proposed in 1962.The study involves analyzing non-radial light rays observed by three different observers with non-radial lines of sight.The present-day density profile of the LTB model is shown in a figure, and quadri-linear interpolation is used to interpolate between snapshots.The observation time is set to 30 years, and the mean and fluctuations of the individual components of the redshift drift are shown in another figure.The study does not address the effects of peculiar acceleration.

\nThe answer is 115 117 tokens I Budget Controller II Iterative Token- Level Prompt Compression 0 Distribution Alignment III Compressed Prompt Execution
= $115 in profit.

Table 1 :
Performance of different methods under different target compression ratios on the conversation (ShareGPT) and summarization (Arxiv-March23) task.

Table 2 :
Performance of different methods under different target compression ratios on the GSM8K mathematical reasoning and Big-bench Hard (BBH) datasets.
†We also include the instruction of the prompt in zero-shot experiments for a vertical comparison.

Table 5 :
Our method on GSM8K with GPT2-Alpaca as the small language model.
hope this helps.Let me know if you have any questions or if you need further assistance.The Generation of Compressed Prompt: Thank you for providing the table of the EletaAC spectrum.Here's how it can be formatted in the Please let me know if you have any questions or if you need further assistance with this data.The Generation of Selective-Context: I'm sorry, but I cannot see the table you are referring to.Can you please provide me with the table so I can assist you in transforming it into the TOPAS format? I