Adapting Language Models to Compress Contexts

Transformer-based language models (LMs) are powerful and widely-applicable tools, but their usefulness is constrained by a finite context window and the expensive computational cost of processing long text documents. We propose to adapt pre-trained LMs into AutoCompressors. These models are capable of compressing long contexts into compact summary vectors, which are then accessible to the model as soft prompts. Summary vectors are trained with an unsupervised objective, whereby long documents are processed in segments and summary vectors from all previous segments are used in language modeling. We fine-tune OPT models on sequences of up to 30,720 tokens and show that AutoCompressors can utilize long contexts to improve perplexity. We evaluate AutoCompressors on in-context learning by compressing task demonstrations. We find that summary vectors are good substitutes for plain-text demonstrations, increasing accuracy while reducing inference cost. Finally, we explore the benefits of pre-computing summary vectors for large corpora by applying summary vectors to retrieval-augmented language modeling. Overall, AutoCompressors emerge as a simple and inexpensive solution for extending the context window of LMs while speeding up inference over long contexts.


Introduction
Transformer-based (Vaswani et al., 2017) language models (LMs) have recently seen a sharp rise in popularity and are now receiving millions of queries, processing billions of tokens, and generating text for a wide variety of applications (Brown et al., 2020;Touvron et al., 2023;Zhang et al., Figure 1: AutoCompressors process long documents by recursively generating summary vectors which are passed as soft prompts to all subsequent segments.2022).With this rise in popularity comes the challenge for researchers to make LMs more efficient, to speed up inference and to deploy LMs at scale, while increasing their versatility, thus allowing users to process more data in new ways.
With these goals in mind, we propose to teach pre-trained LMs the ability to compress text into summary vectors.Summary vectors are short soft prompts (Lester et al., 2021), one or two orders of magnitude shorter than the pre-compressed plain text, that are obtained from the output states of a language model.Summary vectors serve two general purposes: they can help extend the language model's context window to very long documents with minimal computational overhead, and they help speed up inference on text for which summary vectors have been pre-computed and cached.
Our models, which we call AutoCompressors, are trained with a simple unsupervised learning objective that encourages the model to store essential information in the summary vectors.Summary vectors are produced segment by segment from long documents and are used to improve language modeling in future segments (Figure 1).Our work builds on the recently proposed RMT architecture (Bulatov et al., 2022) with a crucial difference: we introduce summary accumulation, in which summary vectors from all segments are concatenated to produce the summary of the entire document.We also train AutoCompressors with randomly segmented inputs so they can better compress contexts of variable lengths in downstream tasks.We show that these innovations improve long-range information retention and enable new ways of reasoning over multiple passages.
AutoCompressors can be initialized with pretrained LMs to produce powerful and versatile models.We fine-tune AutoCompressors from OPT-2.7B (Zhang et al., 2022) and Llama-2-7B (Touvron et al., 2023) models on sequences from 6,144 up to 30,720 tokens with a single NVIDIA A100 GPU of 80GB memory.We show that summary vectors are effective for improving perplexity over long documents and that these compression capabilities are robust to domain generalization.Our analysis suggests that AutoCompressors are able to reason over summary vectors, making them useful for a diverse set of downstream applications.
We apply AutoCompressors to in-context learning (ICL) by compressing up to 90 in-context demonstrations.We consider 11 classification tasks, including 7 SuperGLUE tasks (Wang et al., 2019), and we find that summary vectors outperform few-shot ICL with a comparable number of in-context tokens on 8 out of 11 tasks.
Finally, we explore two applications where AutoCompressors can reduce inference costs by pre-computing summary vectors for large corpora.First, we adopt a setting for retrieval-augmented language modeling (Shi et al., 2023).We find that for equal sequence lengths, using summary vectors achieves 1.5× the perplexity gains compared to plain-text passages, and outperforms retrievalaugmented methods for similar computational budgets.Secondly, we consider a zero-shot passage re-ranking task (Sachan et al., 2022).We establish that re-ranking passages based on their summary vectors achieves the best trade-off between re-ranking performance and inference throughput.
In summary, our main contributions are the following: (1) We introduce a method for extending LMs to long context windows under small-scale computational requirements by learning to generate summary vectors.We propose summary accumulation and training with randomized segmenting as key features of AutoCompressors.(2) We show that summary vectors encode useful information for downstream tasks and can be used to reduce the inference cost of in-context learning.(3) We demonstrate the benefits of pre-computing summary vectors for large corpora and using Auto-Compressors in conjunction with retrievers.

Related Work
Soft prompts Soft prompt tuning is an effective method to adapt pre-trained Transformers without updating existing parameters (Lester et al., 2021;Zhong et al., 2021;Liu et al., 2022).Newly initialized embeddings are prepended to the input sequence (the "soft prompt"), and optimization is performed with respect to these new parameters while the rest of the model is frozen.It is one of many parameter-efficient fine-tuning methods (Lialin et al., 2023) and is related to prefix tuning, where newly initialized parameters are prepended to the attention states instead (Li and Liang, 2021).2022) find that soft prompts retain high-level information and facilitate controllable generation.However, the approach requires running the optimization for every new context x, with no knowledge transfer between similar contexts.In contrast, our AutoCompressors learn to predict their own soft prompts σ as a function of x.Context distillation A related line of work (Askell et al., 2021;Snell et al., 2022) aims to distill incontext information, e.g., instructions, into an unprompted student model.In concurrent work, Mu et al. (2023) teach models to compress instructions into short key-value attention prefixes.Our approach differs by learning to compress any context information, including long documents, and results in more compact soft prompts.Long-range Transformers A number of architectural modifications have been proposed to scale Transformers to longer context lengths while reducing the high memory costs of full attention.These include restricting and sparsifying the attention window (Dai et al., 2019;Child et al., 2019), ap-proximating the attention (Rae et al., 2020;Zheng et al., 2022;Choromanski et al., 2021), as well as introducing recurrent elements (Ma et al., 2022;Bulatov et al., 2022), conditional computation (Ainslie et al., 2023), and retrieving previous tokens from the context at the output layer (Zhong et al., 2022).See Tay et al. (2022) for a comprehensive survey of efficient long-range architectures.

Prompt compression
Most of these architectures typically require expensive training from scratch, or will deviate substantially from a pre-trained initialization.2Moreover, many language models lack the inductive bias to extrapolate to longer sequences (Press et al., 2022).While AutoCompressors could in principle be trained from scratch, we show that they offer a straightforward solution for extending the context window of pre-trained models to longer sequences.

Method
We describe how we adapt a pre-trained language model to compress text into summary vectors.An overview of our architecture is shown in Figure 1.

Summary vectors
The AutoCompressor builds on the RMT architecture (Bulatov et al., 2022).We extend the input vocabulary of the base model by κ special summary tokens <Sum> i and initialize κ new input embeddings. 3When we append the sequence <Sum> 1 . . .<Sum> κ to an input, it signals to the model to output special summary vectors of the preceding context.These vectors can then be passed to the next text segment as a soft prompt of length κ.Since the embedding spaces of pretrained language models can span thousands of dimensions, we expect that this mechanism has a high capacity for passing information to subsequent segments.Furthermore, a soft prompt can interpolate between many token embeddings, and therefore represent more abstract concepts than a single discrete token (Wingate et al., 2022).Summary accumulation We split long documents into segments S 1 , . . ., S n and process them sequentially.Bulatov et al. (2022) incorporate information from previous segments by prepending the compressed summary σ i−1 produced from S i−1 to the embedded inputs of S i .We propose summary accumulation, which allows for a direct information pathway between each segment and all segments preceding it: we concatenate the summary vectors σ 1 . . ., σ i−1 to form σ <i and prepend σ <i to S i .Note that the length of σ <i is now (i − 1)κ, which grows linearly with the document length.Positional embeddings When using a base Transformer architecture with absolute positional embeddings, such as the OPT architecture (Zhang et al., 2022), we do not add positional embeddings to the summary tokens <Sum> i , nor to the summary vectors.This allows us to use all pre-trained position embeddings as context tokens and makes it possible to scale the model to an arbitrary number of compression steps during training.The model still preserves the order of summary tokens due to their separate token embeddings.
If the base Transformer uses relative positional embeddings, such as RoPE (Su et al., 2022), we apply the positional embedding to the summary tokens and vectors without any further modification.

Training Summary Vectors
We use a simple unsupervised training approach which encourages the model to learn to compress contexts over multiple steps.Training objective Write (x i 1 , . . ., x i m i ) for the segment S i for every i ≤ n, where m i is the number of tokens in S i .Conditioning on the concatenated summary vectors σ <i , we project the Transformer outputs with the language modeling head to obtain the next-token probabilities p(x i t | x i 1 , . . ., x i t−1 , σ <i ).We minimize the crossentropy loss over the entire document: where N is the total number of tokens.This objective retains the pre-trained language model's abilities on the first segment S 1 and it incentivizes the model to store useful information in the summary vectors, which future segments can leverage to make better token predictions.
Unlike Wingate et al. (2022), we do not train with a knowledge distillation objective, since the pre-trained LM has a limited context window as a teacher, whereas the AutoCompressor student learns to process much longer documents.Randomized segmenting We randomly vary the lengths m i of the segments S i during training, subject to the condition that each segment fits into the model's context window.This allows Auto-Compressors to compress documents of different lengths and improves performance under evaluation with fixed-length segments (see Figure 2).BPTT with stop-gradients We employ backpropagation through time (BPTT) and gradient checkpointing (Chen et al., 2016) for each segment to reduce the size of the computational graph.In addition, we compute and cache summary vectors and stop their gradients after 2 compression steps, similar to caching past attention states in Transformer-XL training (Dai et al., 2019).This assumes that for learning to compress the useful information in S i , it is sufficient to predict the tokens in the adjacent S i+1 .In Figure 2, we confirm that this incurs no penalty when predicting long segments, while further reducing GPU memory requirements.

Language Modeling Evaluation
In this section, we train AutoCompressors and evaluate their long-range language modeling capabilities by sampling long sequences which we split into segments of 2,048 tokens.We fix the final segment and compress the previous n segments.We track the perplexity of the final segment when conditioning on the summary vectors for each n.
We conduct our main experiments and ablations with OPT models (Zhang et al., 2022) of 1.3B or 2.7B parameters, fine-tuned on 2B tokens from the Pile (Gao et al., 2020).In Section 4.1, we evaluate an AutoCompressor on sequences of 8,000 tokens and compare to an equivalent RMT model and an Extended Full Attention baseline.In Section 4.2, we fine-tune an AutoCompressor on sequences of 30,000 tokens to demonstrate the feasibility on very long sequences.Finally, in Section 4.3, we scale up AutoCompressors by fine-tuning a Llama-2-7B model on 15B tokens from RedPajama (Togeth-erAI, 2023).Full model hyperparameters and data information can be found in Appendix A.

Experiments on 8K-Token Sequences
Setting We initialize all models with the 2.7Bparameter OPT model and fine-tune on 2B tokens from 4 domains form the Pile (Gao et al., 2020).Our AutoCompressor uses κ = 50 summary tokens and is fine-tuned with summary accumulation over four segments, each ranging from 1,024 to 2,048 tokens.Compressing 2,048 tokens into 50 summary vectors achieves a compression rate of 40 tokens per summary vector.We use the following baselines: 1.We fine-tune an OPT-2.7Bbaseline on our data.
This model is limited to sequences of 2,048 tokens due to pre-training.2. Extended full attention: We fine-tune OPT-2.7B on sequences of up to 4,096 tokens by extending the model's positional embeddings.We initialize the embeddings for positions [2049.
.4096] with the embeddings for positions [1..2048].We are not able to extend the context beyond 4,096 tokens due to GPU memory limitations.3. RMT-2.7B:We fine-tune an RMT model on our data with κ = 50 summary vectors.
We evaluate on documents of 8,192 tokens, drawn from the 4 training domains or 4 held-out domains.We generate summary vectors for up to 3 segments of 2,048 tokens, but also for single segments as short as 128 tokens.For the extended full-attention baseline we prepend the previous context tokens to the context window.

Results
We show the results in Table 1.We find that the AutoCompressor benefits from long contexts of 6,144 tokens and consistently outperforms the RMT model.
We also find that the AutoCompressor benefits from much shorter sequences than seen during training, unlike RMT.See also Figure 2 and Table 6 for the usefulness of randomized segmenting.
While extended full attention performs the best on 4,096-long sequences, we observe a trade-off for shorter contexts where AutoCompressors achieve the best performance.We also stress that the AutoCompressor attends to at most 150 additional soft prompts during evaluation, whereas the full attention model is given an additional 2,048 tokens.
These trends hold for both in-domain and outof-domain evaluation.However, the gap between the AutoCompressor and the full-attention baseline increases in the out-of-domain setting, suggesting that the summary vectors generalize slightly less than pre-trained attention heads.

Experiments on 30K-Token Sequences
Setting We fine-tune OPT-1.3B and OPT-2.7B as AutoCompressors on 2B tokens but train on sequences of 30,720 tokens with 20 compression steps. 4We use 50 summary tokens, randomized segmenting, and stop-gradients as before.We also (in-domain) and Gutenberg (out-of-domain).We train with a single NVIDIA A100 GPU and report the CUDA memory required for fine-tuning using a single sequence per batch.AutoCompressors require less memory because we stop gradients after two segments.fine-tune an RMT model from OPT-1.3B, to use as a baseline.We are not able to fine-tune a 2.7Bparameter RMT baseline because the RMT method leads to an out-of-memory error.
All models are evaluated on the final 2,048 heldout tokens of documents of size 30,720 tokens by compressing all previous 2,048-token segments.

Results
We collect our results in Table 2.The evaluation shows that both AutoCompressor models learn to utilize the entire 28K tokens to reduce perplexity, while the RMT baseline does not benefit from doubling the number of context tokens from 14K to 28K.This shows that summary accumula-tion effectively captures long-range dependencies in documents.We also report the CUDA memory requirements for fine-tuning each model in Table 2.We train with one NVIDIA A100 GPU with 80GB of memory.Stopping gradients reduces CUDA memory and makes it possible to fine-tune an Au-toCompressor from OPT-2.7B, while fine-tuning with RMT leads to out-of-memory at that scale.

Scaling Up AutoCompressors to Llama-2
Setting We fine-tune a 7B-parameter Llama-2 model as an AutoCompressor on a single GPU by freezing the model and optimizing only the summary token embeddings and the attention weights via LoRA (Hu et al., 2022).The model is trained on 15B tokens from RedPajama (TogetherAI, 2023), split into sequences of 6,144 tokens, and we use 50 summary tokens, randomized segmenting, and stop-gradients.We also fine-tune an Extended Full Attention baseline on the same dataset.The context window of the pre-trained model is extended by increasing the θ value in RoPE following (Rozière et al., 2023).
We compare both models to the pre-trained Llama-2-7B model, which has a context window of 4,096 tokens.All models are evaluated on the final 2,048 tokens of 8,192-token documents.

Results
We collect our results in Table 3.The AutoCompressor benefits from the entire context to reduce perplexity: compressing a 4,096-token context into 100 summary vectors achieves similar perplexity to the Extended Full Attention baseline with 512 plain text tokens, and compressing a 6,144-token context into 150 summary vectors further improves perplexity slightly.Moreover, we find that summary vectors preserve perplexity when short contexts are compressed.
However, Llama-2 and the Extended Full At- tention baseline outperform the AutoCompressor when longer contexts are provided.Further research is needed to construct summary vectors that preserve all of the context information.

Analysis
Ablations We train OPT-2.7Bmodels without randomized segmenting, summary accumulation, or stop gradients.The results are shown in Figure 2. We find that randomized segmenting leads to better compression of short segments, but still improves perplexity when compressing multiple 2048 token segments.As expected, summary accumulation helps improve perplexity beyond one compressed segment.We also confirm that stopping gradients does not impact performance despite reducing GPU memory requirements.In Table 2, we also show that stopping gradients helps reduce GPU memory.
We also train AutoCompressors with κ = 20, 50, 70 or 100 summary tokens and report the heldout perplexity results in Table 7 in the Appendix.Surprisingly, we find that performance does not increase with longer soft prompts, and κ = 50 performs the best overall.We hypothesize that learning a larger number of summary vectors may require a larger training budget.
Token-level analysis We seek to better understand how summary vectors benefit individual token predictions.In Figure 5 in the Appendix, we show perplexity gains at each token position for the Au-toCompressor with summary vectors and for the extended full-attention baseline.
We find that conditioning on summary vectors improves perplexity over all 2048 token positions.We observe that the extended full attention baseline outperforms the AutoCompressor at the start of the sequence, whereas the AutoCompressor achieves the best performance towards the end of the sequence.This shows that summary vectors effectively capture long-range textual dependencies.
In Appendix D, we show examples of sentences and tokens which benefit the most from summary vectors.We find that summary vectors contain salient information, such as names or dates, and that the model can reason over summary vectors.This confirms that summary vectors are useful summaries of the compressed text.

Compressing Demonstrations for
In-Context Learning In this section, we study the usefulness of summary vectors for performing downstream tasks.We show that in-context demonstrations can reliably be compressed down into summary vectors to improve performance while also increasing efficiency on a diverse set of NLP benchmarks.
Evaluation We evaluate the in-context learning abilities of the AutoCompressor based on Llama-2-7B from Section 4.3 on eleven classification and multiple-choice question-answering datasets.For each dataset, we evaluate the effect of compressing 1, 2 or 3 segments of demonstrations into 50, 100 or 150 summary vectors.For each segment, we include as many demonstrations as possible until we reach 750 tokens.For SST-2, this corresponds to 30 demonstrations per segment on average.We compare this compression approach with the results obtained by prompting the model using 150 and 750 tokens' worth of plain-text demonstrations.
We use contextual calibration (Zhao et al., 2021) and class-balanced sampling when these techniques improve performance on a validation set.For each dataset, we report the mean accuracy and standard deviation over 7 random seeds.The detailed settings for each dataset can be found in Table 11.In Table 12 in the Appendix, we also compare the ICL performance of our OPT-2.7Bbased Au-toCompressor models against the RMT baseline and a pre-trained OPT-2.7B, and include the performance of the pre-trained Llama-2-7B model.

Results
We show evaluation results in Table 4. Results show that summary vectors consistently improve performance over the zero-shot baseline.Furthermore, summary vectors increase accuracy Table 4: Evaluation of the ICL performance of the Llama-2 7B model.Each summary is 50 tokens-long and corresponds to a segment of 750 tokens' worth of demonstrations.We also report accuracies when prompting the AutoCompressor with 150 and 750 tokens' worth of plaintext demonstrations as baselines.Note that for BoolQ and MultiRC, demonstrations are too long to fit into 150 tokens.
compared to 150 tokens worth of plain demonstrations on 8/11 tasks.On 8 tasks (AG News, SST-2, BoolQ, WiC, WSC, CB, COPA and MultiRC), summary vectors also out-perform ICL with 750 tokens' worth of plain text demonstrations.Summary vectors emerge as a strong alternative to plain text demonstrations, as they increase accuracy while reducing inference cost.
In Table 12 (Appendix E), we find that the OPT-2.7BAutoCompressor achieves higher accuracies than the RMT baseline on 8 out of 11 tasks and that the RMT model does not benefit from multiple compression steps.This shows that summary accumulation is an effective mechanism for compressing in-context demonstrations.We also observe that our fine-tuned Llama-2 AutoCompressor has substantially worse zero-shot accuracy on some tasks compared to the Llama-2 initialization, and slightly worse ICL performance.We suspect that this is due to domain mismatch in our fine-tuning data and the Llama-2 pre-training corpus.

Compressing Retrieval Corpora for Efficient Inference
We study the usefulness of pre-computing summary vectors for large collections of documents.These can be stored and later retrieved for efficient inference.Since inference is typically more expensive than storage, this approach has the potential to achieve good practical trade-offs.

Retrieval-augmented Language Modeling
Retrieval-augmented language models improve token predictions by retrieving information from a data store.A number of approaches have been proposed to infuse external knowledge in the input layer (Guu et al., 2020;Shi et al., 2023), intermediate layers (Borgeaud et al., 2022) or at the output layer (Khandelwal et al., 2020;Zhong et al., 2022).Figure 3: Efficient retrieval-augmented language modeling with AutoCompressors.Large corpora can be pre-processed into compressed summary vectors which can be stored cheaply.Upon retrieval, compressed summaries are fused for efficient access to multiple documents in a single forward pass.

Pre-process Corpus
REPLUG Our case study focuses on REPLUG (Shi et al., 2023), which is a simple method for combining a pre-trained language model with an offthe-shelf retriever to improve language modeling performance.Given access to an external corpus C, REPLUG retrieves k passages D = {d 1 , . . ., d k } based on a segment x to score the next segment y.The overall probability for y is computed by ensembling the predictions based on different passages: where λ(d, x) are the normalized similarity scores from the retriever and CONCAT(d, x) denotes concatenation of p and x.This method incurs a substantial overhead, since it requires k forward passes over sequences CONCAT(d, x, y).Fused Summaries We introduce a setting for retrieval-augmented language modeling close to fusion-in-decoder (Izacard and Grave, 2021).We concatenate the summary vectors of retrieved passages D to form the fused summary vectors, σ D = CONCAT(σ d k , . . ., σ d 1 ), where d k , . . ., d 1 are ordered from least-to-most relevant.This resembles Table 5: PPL gains (%) from different retrieval-augmented language modeling settings, over the no-retrieval baseline.
We evaluate the OPT-2.7BAutoCompressor and we report throughput on a single NVIDIA A100 GPU for each method without batching examples.Fused Summaries outperforms Fused Passages and REPLUG with 50-token passages.Moreover, Fused Summaries top-10 outperforms REPLUG top-2 with 512-token passages while also gaining a 1.7× throughput increase.
Moreover, Fused Summaries outperforms RE-PLUG top-2 with 512-token passages and sees a 1.7x throughput increase, which shows that the model benefits from the diversity of compressed documents.However, REPLUG top-10 outperforms Fused Summaries.We leave it as future work to explore how to produce higher quality summary vectors to better utilize the compressed passages.
We note that fusing summary vectors is effective despite a mismatch in training since we draw independent summary vectors from separate documents.Furthermore, our AutoCompressor model is only ever trained to accumulate 3 sets of summary vectors, and yet it benefits from fusing the summary vectors of up to 10 documents.

Unsupervised Passage Re-ranking
Finally, we consider the case study of passage reranking, in which a fast off-the-shelf retriever like BM25 retrieves a large set of candidate passages, and a more capable re-ranker refines the ranking to increase the rank of the most relevant passages.Method Sachan et al. (2022) introduce an effective method for leveraging language models as re-rankers with no additional supervision or finetuning.Given a query q and a set of candidate passages {p 1 , . . ., p k }, the language model scores the likelihood of the query q conditioned on the prompt "Passage: {p i }.Please write a question based on this passage."for each passage p i and re-ranks the passages based on the scores.Experiments We consider the task of re-ranking BM25 passages on the NQ test set (Balachandran et al., 2021) and compare out-of-the-box AutoCompressors with 20 and 50 summary tokens to pretrained OPT models from 125M to 2.7B parameters.We pre-compute summary vectors for 21M passages from a Wikipedia corpus (Karpukhin et al.,  Figure 4: We compare AutoCompressors (squares) in an unsupervised passage re-ranking setting to pre-trained language models (circles).The number on each data point shows how many passages retrieved by BM25 are re-ranked, and the vertical axis shows the Recall@20 performance of the re-ranking system on the NQ test set.We consider the throughput on a single NVIDIA A100 GPU and assume that multiple queries cannot be batched.By leveraging pre-computed summary vectors for passages, AutoCompressors lead to re-ranking solutions that lie on the Pareto front of recall vs. compute.2020), which requires 2.1TB and 5.4TB disk space in half precision for 20 and 50 summary vectors respectively.We measure the quality of the re-ranked results using Recall@20.

Results
The results are shown in Figure 4. We measure throughput for individual un-batched queries on a single NVIDIA A100 80GB GPU and assume that the latency of loading summary vectors is negligible.Although the passages are only 100 words long, resulting in low compression rates, summary vectors substantially speed up the inference, while sacrificing on performance less than smaller models.This leads to a Pareto-optimal trade-off between compute and performance and demonstrates that summary vectors often retain sufficient information from a passage to assess its relevance for a particular query.

Conclusion
We have introduced a training strategy for adapting pre-trained LMs into AutoCompressors, which recursively compress contexts into summary vectors.Our experiments indicate that summary vectors retain important contextual information, that they can encode in-context demonstrations, and that they can be used in retrieval settings.Summary vectors can also be pre-computed, cached and re-used.This offers practical efficiency gains by reducing the size of the attention window.Significant future work remains in scaling AutoCompressors to bigger models and improving the quality of summary vectors to further close the gap with full attention over long-range contexts.

Limitations
1. We only apply AutoCompressors to OPT models of up to 2.7B parameters and a Llama model of 7B parameters.Future work needs to establish how AutoCompressors perform for even larger models.As the summary vector dimension grows, there is promise for retaining more information per vector.2. Our results suggest that summary vectors ignore some useful information that is accessible via full attention.Additionally, models do not always benefit from increasing the number of summary vectors.We suspect that the training signal for learning summary vectors efficiently might be limited by pre-trained models being very good at making predictions from the plaintext tokens in the current segment.Future work is needed to improve this optimization.3. Summary accumulation still leads to quadratic complexity with increasing number of segments, albeit at a much lower rate than full attention.Future work may explore ways to combine many summary vectors more efficiently.

B No-context Language Modeling
In Table 6, we verify that our fine-tuning strategy does not significantly affect the language modeling capabilities of the OPT AutoCompressors when no summary tokens are given.We find that the Auto-Compressor performs slightly better than the RMT model and significantly better than the extended full attention model when no additional context is given.Moreover, the AutoCompressor almost matches the OPT02.7Bfine-tuned baseline, with perplexity increasing by less than 1%. In

C AutoCompressor Ablations
We train OPT AutoCompressor models as in Section 4.1 while varying κ = 20, 50, 70, 100.In Table 7, we report the perplexity evaluation on documents of 8192 tokens across all evaluation domains.

D Token-level AutoCompressor Analysis
In Figure 5, we plot the perplexity gains achieved by the OPT AutoCompressor and the extended full attention baseline from Section 4.1 over the pre-trained OPT-2.7Bmodel.We plot the gains achieved by the AutoCompressor both without any additional context and with the summary vectors obtained from 2048 compressed tokens.
Results show that the summary vectors help reduce perplexity over the entire 2,048-token segment.This shows that summary vectors do not only contain information which helps continue the previous sequence.
Figure 5 also shows that the extended fullattention baseline benefits more from the additional 2,048 context tokens than the AutoCompressor at the start of the sequence, but that the AutoCompressor achieves stronger gains at the end of the sequence.This shows that summary vectors effectively capture long-range textual dependencies and that fine-tuning AutoCompressors produces more robust models than fine-tuning extended fullattention models.In Tables 9 and 10, we give hand-picked examples of sequences from each evaluation domain, highlighting which tokens benefit the most from the compressed context.We compress the first 300 tokens in every document from the evaluation set and evaluate on the following 100 tokens.In the notation of Section 3.1, we measure the perplexity gain of each token as .
For each example, we record the top 3-5 most improved token predictions.We find that the tokens which benefit the most from the summary vectors are often interpretable.Names of characters, dates, and locations are often copied through the summary vectors (see the examples for Wikipedia, FreeLaw, or HackerNews).We also find that the model is able to reason over the summary vectors, as the tokens which benefit the most are sometimes not explicitly present in the compressed context, but are closely associated with the domain of speech (see the examples for Books3, Gutenberg and YoutubeSubtitles.).Finally, we find that summary vectors are often useful for continuing the previous sentence (see the GitHub example.)2005)), Subj (Subjectivity, Pang and Lee (2004).We follow the GPT-3 prompt templates (Brown et al., 2020) and detail our evaluation setting for OPT and Llama-2 in Table 11.

E In-Context Learning Details
In Table 12, we compile evaluation results for OPT-2.7B,Llama-2-7B, as well as our AutoCompressor and RMT models.

F Fused Retrieval-augmented Language Modeling
Perplexity Gain (%) Table 8: PPL gains (%) over the no-retrieval baseline for Fused Summary with and without re-ranking.In re-ranking, we order the passages based on the ℓ 2 norms of their summary vectors before concatenating the summary vectors, whereas w/o re-ranking we use the retrieval scores from the Contriever model.Re-ranking consistently produces higher perplexities.
We provide details and ablations for our proposed REPLUG alternative.Inspired by fusion-indecoder (Izacard and Grave, 2021), we fuse summary vectors or passages in a single forward pass.Fused Summary Vectors The summary vectors of retrieved passages D are concatenated in order of increasing retrieval scores to form fused summary vectors, σ D = Concat[σ d k , . . ., σ d 1 ].This resembles summary accumulation as described in Section 3, but differs in that the retrieved summary vectors were produced independently rather than recursively.Nevertheless, we find that AutoCompressors transfer well to this setting.
Furthermore, we find it beneficial to smooth the conditioned probabilities with the unconditioned probabilities p(y | x), and compute We also show that language-modeling performance improves when D is re-ordered based on the smallest ℓ 2 distance between the summary vectors {σ(d 1 ), . . ., σ(d k )} and σ x .This incurs negligible overhead since σ x can be constructed during the same forward pass which computes p(y | x).The ablation for this is shown in Table 8 Fused Passages We establish a baseline for Fusing Summary Vectors by concatenating the corresponding plain-text passages D = Concat[d k , . . ., d 1 ] and computing Note that this approach is quickly limited by the size of the pre-trained language model's context window, especially when retrieving many long passages.94.5 (0.8) 70.3 (6.1) 54.9 (1.9) 42.2 (5.0) 71.3 (4.4) 51.3 (3.5) 85.3 (0.7) 47.0 (1.5) 92.9 (0.5) 65.4 (14.5) Table 12: We evaluate the following models on 11 in-context learning tasks: The OPT-2.7B AutoCompressor and RMT model, the Llama-2-7B AutoCompressor, and the respective pre-trained models.For each fine-tuned model, numbers in bold are the highest evaluation results using at most 150 additional tokens.When using summary vectors, the OPT-2.7BAutoCompressor outperforms the RMT model on 8/11 tasks.Moreover, the OPT-2.7BAutoCompressor benefits from multiple compression steps on most tasks whereas the RMT model performs best without summary vectors on 7/11 tasks and benefits from 3-step summary vectors on none of the above tasks.The Llama-2 AutoCompressor achieves the absolute highest accuracy using summary vectors on 7/11 tasks.It also achieves the highest accuracy with summary vectors on 9/11 tasks using at most 150 additional tokens.
Wingate et al. (2022) propose to learn a soft prompt σ to compress the information contained in a context x.Given a pretrained language model p LM , they draw continuations y ∼ p LM (• | x) based on x and use a distillation objective to align the model's predictions conditioned on the soft prompt p LM (y | σ) to the predictions conditioned on the context p LM (y | x).Wingate et al. (

Figure 2 :
Figure2: Perplexity on 2048 held-out tokens given different numbers of compressed tokens.Compression is performed on up to 3 segments of 2048 tokens.Ablations show that the different components of our finetuning strategy help boost performance and that stopgradients do not impact performance.

Figure 5 :
Figure 5: We plot the perplexity gain over OPT-2.7B for our AutoCompressor model and the 4096-extended attention baseline.We track the perplexity at each token position in sequences of 2048 tokens.The Auto-Compressor model almost matches the strong extendedattention baseline at the start of sequences and outperforms it at the end of sequences.

Table 1 :
↑1.0% 6.15 † ↓2.1% 5.94 † ↓5.4% --8.57† ↑0.5% 8.28 † ↓2.9% 7.93 † ↓7.0%Held-out perplexity on 2,048 tokens, while varying the length of the preceding context (all the experiments are based on OPT-2.7Bmodels).For RMT and AutoCompressor, we condition on summary vectors.We also report the perplexity gains compared to the fine-tuned OPT baseline without extra context, which achieves 6.28 in-domain and 8.53 out-of-domain (gains shown in colored numbers).†: Although the extended full attention (Extended FA) achieves similar or slightly better perplexity, it uses up to 2,048 additional tokens and cannot extend further.However, the AutoCompressor uses only 50 × 3 = 150 summary vectors to process 6,144 context tokens.

Table 2 :
Evaluation results for AutoCompressors trained on sequences of 30,720 tokens and evaluated on Books3

Table 3 :
Evaluation results for our AutoCompressor trained from Llama-2 7B on sequences of 6,144 tokens.For the AutoCompressor, we condition on summary vectors.For Llama-2 and the Extended Full Attention (Extended FA), we condition on plain text tokens.

Table 6 :
Held-out perplexity of all models on 2048 tokens without summary vectors or additional context.