Parallel Context Windows for Large Language Models

When applied to processing long text, Large Language Models (LLMs) are limited by their context window. Existing efforts to address this limitation involve training specialized architectures, and cannot be easily applied to off- the-shelf LLMs. We present Parallel Context Windows (PCW), a method that alleviates the context window restriction for any off-the-shelf LLM without further training. The key to the approach is to carve a long context into chunks (“windows”), restrict the attention mechanism to apply only within each window, and re-use the positional embeddings across the windows. Our main results test the PCW approach on in-context learning with models that range in size between 750 million and 178 billion parameters, and show substantial improvements for tasks with diverse input and output spaces. We show additional benefits in other settings where long context windows may be beneficial: multi-hop questions and retrieval-augmented question answering with multiple retrieved documents. Our results highlight Parallel Context Windows as a promising method for applying off-the-shelf LLMs in a range of settings that require long text sequences. We make our code publicly available at https://github.com/ ai21labs/parallel-context-windows.


Introduction
A key parameter of a Large Language Model (LLM) is its context window, the number of text tokens it can process in a forward pass. Current LLM architectures limit the context window sizetypically up to several thousand tokens-because the global nature of the attention mechanism imposes computational costs quadratic in context length. This presents an obstacle to use cases where the LLM needs to process a lot of text, e.g., tackling tasks that require long inputs (Tay et al., 2020;Shaham et al., 2022), considering large sets of retrieved documents for open-book question answer- ing (Karpukhin et al., 2020;Levine et al., 2022a,b), or performing in-context learning (Brown et al., 2020) when the desired input-output relationship cannot be adequately characterized within the context window.
Previous work has addressed such obstacles by training dedicated architectures, e.g., training sparse attention mechanisms for long inputs (Zaheer et al., 2020;Guo et al., 2021) and Fusion-in-Decoder readers for retrieved documents (Izacard and Grave, 2020). However, these architectures are often tailored to specific use cases, and they are often constrained in terms of their size as a tradeoff, in order to facilitate long text consumption. It remains an open problem to find an effective way to allow off-the-shelf LLMs to process text longer than its original context window, without dedicated training.  In this paper, we introduce Parallel Context Windows (PCW), illustrated in Figure 2, a new approach for addressing this problem in any decoderbased LLM 1 , and show its efficacy in several setups. PCW involves splitting long text into multiple parallel contexts, each equally accessible during output generation. Doing so consists of two simple post-hoc modifications to a pretrained LLM, neither of which requires any further training: (1) using sparse masking to allow each context window to attend only to itself, while still allowing the generated text to attend to all contexts simultaneously; and (2) reusing the model's learned positional embeddings within each parallel context window, sidestepping the problem of extrapolating positional embeddings and signaling to the model that each window is equally "close" to the generated tokens.
We conducted an in-depth investigation of the extent to which Parallel Context Windows can improve LLMs' ability to perform in-context learning (Brown et al., 2020): when a pretrained LLM is given an input sequence of concatenated "training" input-output pairs representing a task, followed by a single "test" input, it is able to supply the corresponding test output with high accuracy. Crucially, in the setting of in-context learning, the context window limitation inherently caps the number of training examples that can be inserted before the test example. This significantly limits the applicability of in-context learning for tasks with long or highly diverse inputs or outputs.
We focus on these types of tasks, showing that Parallel Context Windows significantly aid incontext learning of two task families that tend to suffer from low in-context learning performance: classification tasks that have many classes and extractive question answering tasks. We experiment with Jurassic-1 models (Lieber et al., 2021) having between 7B and 178B parameters and GPT2 models (Radford et al., 2019) having between 750M and 1.5B parameters. Notably, using 3 Parallel Context Windows leads to average performance gains of 6.7, 7.3, and 7.9 points in the in-context learning scores of classification tasks with over 5 classes for Jurassic-1 models of sizes 7B, 17B, and 178B, respectively (see example in Figure 1). Our results show that Parallel Context Windows broad-  The (i, j) cell in the matrix is colored iff the i th token can attend to the j th token. Each context window (in grey) attends to itself and is assigned positional embeddings ( p i ) independently, thus re-using the positional vectors. Task tokens (in blue) attend to all the windows. PCW makes the attention matrix sparser, effectively parallelizing the processing of multiple windows. ens the scope of tasks that can be learned via the popular setup of in-context learning, to tasks that require more training examples than permitted in current context sizes.
We further explore the applicability of PCW to two other settings that may benefit from the integration of several documents. One is multi-hop question answering, where the different pieces of information are shown in different windows. We show that in some cases parallel reading is beneficial, through a test case on the HotpotQA benchmark (Yang et al., 2018). The other setting is retrievalaugmented question answering, where we show that reading several retrieved documents in parallel is advantageous, through a test case on the Natural Questions benchmark (Kwiatkowski et al., 2019).
Overall, we provide clear evidence that, without any further training, Parallel Context Windows can make a large amount of text accessible to an off-the-shelf LLM during decoding. We thus see promise in further investigation of Parallel Context Windows for applying off-the-shelf LLMs in other applications that require such capabilities, such as tackling tasks with long inputs.

Parallel Context Windows
This section provides the details of our Parallel Context Windows method. The high-level idea of PCW is to insert a long input sequence into multiple replicas of the LLM's original context window, and to allow for a small amount of tokens at the end of the sequence to attend to all of the context windows simultaneously. We design PCW so that the modifications made to the off-the-shelf LLM are minimal, such that processing long contexts remains effective even without further training of the LLM. A side advantage is that the LLM modifications required for PCW are quite simple to implement. Specifically, PCW applies two modifications to two mechanisms in common autoregressive LLMs: the positional embeddings (Section 2.1) and the attention mask (Section 2.2). Figure 3 illustrates both changes.

Positional Embeddings Modification
Denoting the LLM's original context window size by N and the Transformer's input representation dimension by d, Transformer-based LLMs receive information regarding the input text ordering via a set of N positional embeddings { p i ∈ R d } N i=1 , by adding p i to the input token embedding in position i.
We conceptually divide the tokens at the input of the LLM into context tokens and task tokens. The context tokens are inputs that assist the LLM with a given task, such as in-context examples, or relevant retrieved documents. Task tokens refer to the input of the test example, e.g., a sentence to be classified or a question.
When considering a task that requires T task tokens to formulate, the fact that there are only N trained positional embeddings implies that effectively only C = N − T input tokens can be processed as context. 2 In order to implement PCW, we expand the number of processable context tokens by a factor of B such that the overall input sequence can include B · C + T tokens. In order to allow LLMs to process this long sequence of text, we assign one of N learned positional embedding vectors to location i ∈ {1, . . . , B · C + T } by the following mapping (depicted in Figure 3): In words, via this mapping, the model effectively identifies B replicas of the first C original positional embeddings, and the T task tokens retain the last T positional embeddings, now seeing these B replicas as context in their near past. We refer to these replicas of the positional embeddings as context window replicas. Notably, while the above re-use of the positional embeddings assigns meaningful positions to all tokens within the longer input sequence, the memory cost of this expansion is quadratic, and moreover, the model was not trained to have two tokens in the same position attend to each other. To address these, we describe below a modification to the LLM's attention mechanism.

Attention Mask Modification
We impose a restriction on the attention mechanism which implies that tokens within each context window replica perform autoregressive attention to other tokens in their context window replica, and do not attend to tokens in other context window replicas. In contrast, the task tokens attend to context tokens within all context window replicas.
In the above setting of context window size N , we represent attention restrictions by attention mask scores a ii ∈ {0, 1} for i, i ∈ [N ] := {1, . . . , N }. If a ii = 0 then for any Transformer layer in the LLM, tokens in input location i cannot attend to tokens in input location i , and if a ii = 1 they can. In common autoregressive LLMs, a token can only attend to tokens that precede it, which following the above notation is translated into a ii = 1 if 1 ≤ i ≤ i ≤ N and a ii = 0 otherwise.
For the case of PCW, the B parallel context windows include tokens in positions i ∈ [C], and are identified with an index b ∈ [B]. The T task tokens are not parallelized, and are located in positions i ∈ {C +1, . . . , C +T = N }. For completeness of the notation, we will assign a dummy context window index b = B+1 to the T task tokens. We add a second index to the attention scores: a bb ii ∈ {0, 1} for i, i ∈ [N ] and b, b ∈ [B]. Similarly to the above, if a b,b ii = 0 then for any Transformer layer in the LLM, tokens in input location i and context window b cannot attend to tokens in input location i and context window b , and if a b,b ii = 1 they can.
With the above notation in place, the following restriction implies that context tokens perform autoregressive attention within each context window replica (illustrated in Figure 3): The following implies that the T task tokens attend to all tokens in all B context windows (for i > C): The above attention masks allow the model to attend to B times more context when decoding the output, while keeping the computational cost linear in the number of parallel contexts B. Overall, for both the above PCW modifications, assigning B = 1 corresponds to the vanilla LLM mechanism.

Experimental Setup
We apply the PCW method in the setting of incontext learning (ICL): we distribute the in-context training examples among the multiple context window replicas, thus allowing the test example to attend to more training examples. For each experiment, we report the performance with regular ICL, using the maximum number of examples that fit in a model's context window (n max ). For our PCW method, given B parallel windows, we effectively use B × n max training examples. The n max used for each dataset and model can be found in Table 9. Unless stated otherwise, we report results with B = 3 in the main paper, and discuss the choice of B in Appendix C. Since training examples vary in length, we allocate in-context examples into the parallel windows in a manner that balances the windows' lengths. 3 The test example (corresponding to the T task tokens in Section 2) receives the positional embedding that immediately follows the longest context window.
Training and test sets The performance of incontext learning was shown to significantly vary with the choice of training examples (Zhao et al., 2021). We followed past work (Zhao et al., 2021;Lu et al., 2021), randomly sampling 30 sets of training examples from the full training set. We report the mean and standard deviation of performance metrics across these samples. When comparing PCW method with standard ICL, statistically significant differences according to a t-test (p-value < 0.05) are marked with * . To allow for an extensive set of experiments, we followed prior work and randomly subsampled the test sets to contain at most 250 examples (Zhao et al., 2021;Lu et al., 2021;Han et al., 2022). Datasets Our main focus is classification, and we experiment with 15 different datasets in this category, listed in Appendix B. Many of these datasets are used in prior work on in-context learning (Zhao et al., 2021;Lu et al., 2021;Han et al., 2022). We additionally experiment with several datasets with a high number of output classes (up to 150), to examine how well our approach works in this setting. To classify an example in the in-context learning setup, we assign the label using restrictive greedy decoding (see Appendix A). We also experiment with another type of tasks, information extraction, and test 4 datasets with a subset of the models (J1-Large and J1-Grande). For these tasks we use greedy decoding at temperature 0 (as in Zhao et al. (2021)). For further information about the decoding and formats used for the different types of datasets, see Appendices A and B.

Classification Tasks Results
PCW enables in-context learning with a large number of classes. Table 1 shows the results on various classification tasks, organized by the number of classes. With a small number of output classes (≤ 5), we find small or insignificant differences between PCW and vanilla ICL on J1-Large (7.5B), while with J1-Grande (17B) and J1-Jumbo (178B), PCW is superior in the majority of cases. However, many of these differences are not statistically significant.
Our PCW method shines in classification tasks with a large number of output classes. With more Each data point represents the average gain across all datasets and J1 models. There is a a strong positive correlation between the number of unique labels and the gains from PCW. than 5 classes, PCW statistically significantly outperforms ICL in nearly all models and datasets. The average improvement across these datasets is 6.7, 7.3, and 7.9 for J1-Large, J1-Grande, and J1-Jumbo. Evidently, the larger the model, the greater the benefit from our method. This positive scaling behavior of PCW stands in contrast to prior work attempting to improve ICL (Zhao et al., 2021;Lu et al., 2021;Han et al., 2022), where improvements to 178B-scale models were smaller than improvements observed in smaller models.
In Table 5 (Appendix D.1), we report results with GPT-2 models. Although they are smaller than J1 models, we find consistent statistically significant improvements with GPT2-XL (1.5B parameters) in almost all datasets. With GPT2-Large (0.75B), we find improvements in the majority of datasets.
PCW improves with more classes. To examine the relation between the number of output classes and the performance of PCW, we compute the difference between PCW and ICL in each experiment, and average over all datasets (and models) having the same number of classes. As Figure 4 shows, there is a strong positive correlation between the number of classes and the improvement brought about by PCW (Pearson correlation r = 0.93 between the log-number of classes and the average improvement; the slope is 3.02). For datasets with dozens of unique labels-specifically Banking77  ing work has not considered datasets with such a larger number of classes, perhaps due to the standard limitation of the context window size. 4 We note that in GPT-2 models (Table 5, Appendix D.1) we do not see a significant correlation between PCW improvements and the number of classes, but these smaller models tend to struggle with very large numbers of classes.
Comparing results for datasets with different numbers of output classes may be confounded by other factors, such as differences in domain, style, or genre. To isolate such effects, we compare results with two datasets, each having both fine-grained and coarse-grained labels: (1) The TREC dataset (Li and Roth, 2002), which has 6 coarse-grained and 50 fine-grained classes. (2) NLU (Xingkun Liu and Rieser, 2019), 5 which has 18 scenarios and 68 intents. From Table 1, we see that PCW outperforms standard ICL by 2.6 and 8.1 points on TREC coarse-grained and finegrained classification, respectively. Similarly, on NLU coarse-and fine-grained classification, we see average improvements of 2.5 and 9.0 points, respectively. We conclude that our approach shines especially well when dealing with a large number of output classes. 4 An exception is Alex et al. (2021), who evaluated GPT3 on Banking77 in a limited setting, but obtained poor results. 5 Note that the NLU dataset is also misleadingly known as HWU64; see the Huggingface dataset page for more details.
PCW makes in-context learning more stable. A known limitation of in-context learning is high variance across examples and sensitivity to aspects like the order of examples (Lu et al., 2021). Encouragingly, we find that PCW reduces such variance: We observe average std values of 3.1, 2.3, and 2.6 for J1-Large, J1-Grande, and J1-Jumbo with PCW, compared to 3.9, 3.4, and 3.9 in standard ICL. Table 2 shows the results of ICL and PCW on information extraction datasets with tasks like airline name extraction or extractive question answering. These tasks can be considered as classification tasks with an extremely large number of classes, potentially the entire vocabulary or phrases from the vocabulary. Our approach consistently improves results with both J1-Large and J1-Grande, resulting in statistically significant improvements in almost all cases. We also observe smaller standard deviations with PCW compared to ICL.

Information Extraction Results
It is worth noting that prior work has not experimented much with information extraction in an incontext learning setting. Zhao et al. (2021) reported results with several datasets, but not with extractive question-answering. Our approach seems to allow in-context learning in such cases as well.

PCW for Question Answering
In this section, we explore potential usages of PCW in other settings than in-context learning. Specifically, we examined two question-answering settings where PCW is expected to help aggregate information from multiple texts. Firstly, we consider the case of question answering based on retrieved documents. Secondly, we experiment with multi-hop reasoning, where the model is required to utilize more than one text while answering a question. Importantly, while in Section 3 the parallel context windows were used for processing training examples for ICL, in this section the windows are used for parallel processing of documents related to the test example.

Retrieval Based Question Answering
Setup We first experiment with Natural Questions (NQ, Kwiatkowski et al., 2019) in an openbook question-answering retrieval setting: Given a question and a set of candidate documents, that may or may not contain the evidence for the question, a model needs to generate a free-text answer.
In the single context window setting (the baseline), we followed the few-shot setup defined by Lazaridou et al. (2022): For each question, we retrieved evidence documents from Wikipedia, using a BM25 sparse retriever (Robertson et al., 2009). We then prompted the model with in-context training examples of the related task of extracting the answer from a gold evidence document, and concatenated the test question and N ∈ {1, 2, 4, 6, 8, 10} evidence documents 6 . To fully utilize the context   window size, we "padded" the prompt with as much in-context training examples as possible. For PCW, we followed the setup of a single window while taking advantage of the method's natural ability of parallelization: We increased the number of retrieved documents per question, and divided them between windows. E.g., for N = 1 and 3 parallel context windows (B = 3), PCW processes B × N = 3 retrieved documents (1 per each window), thus effectively increasing the chance that the correct answer span will be shown to the model in one of the retrieved documents. The metric we used was the standard Exact Match (EM). We refer to Appendix A for more details.
Results Figure 5 shows the results for J1-Grande, when using PCW compared to the baseline, as a function of the number of candidate documents in a single window. In all cases, PCW performs better than the baseline, demonstrating the benefit of parallel processing of candidate documents. As we increase the number of available retrieved documents, we see an increase in performance for both approaches. Similar trend can be seen for J1-Large (see Figure 6 in Appendix). Naturally, the performance of this task depends on the probability of retrieving the correct answer. The latter increases in PCW setting, when the number of processed documents is multiplied by B = 3.

Multi-hop Question Answering
Setup Finally, we experiment with HotpotQA (Yang et al., 2018), which requires multi-hop reasoning. Given a question and 10 evidence documents (2 gold and 8 distractors), answering the question requires reasoning over both gold documents. HotpotQA includes two question types 7 : (a) Questions that refer to a bridge entity. For example, to answer the question "when was the singer of Radiohead born?", one needs to reason that the singer is "Thom Yorke" (the bridge entity) and then find his birthday. (b) Questions that rely on a comparison between two entities. For example: "Who has played for more NBA teams, Michael Jordan or Kobe Bryant?". As a baseline, we provide all of the evidences in a random, sequential manner. For PCW, we use 5 windows, with 2 evidences in each window. Since the 10 evidences filled most of the context window of J1 models, we work in a zero-shot setting. The evaluation metric is the standard Exact Match (EM).
Results Table 3 shows the results. We break down the results according to the bridge and comparison question types. Interestingly, PCW helps with comparison questions, improving performance over the baseline in both J1-Large and J1-Grande while degrading the performance on bridge questions. This disparate behavior can be explained by the kind of processing required to answer the two types of questions. In comparison questions, the model can extract the necessary information from the two gold texts independently, making them suitable for PCW. For example, to know who played for more NBA teams, the LM needs to extract the number of NBA teams Jordan played for from one text, while extracting the number of NBA teams Bryant played for from another independent text. In contrast, to answer a bridge question, the processing of each text is conditioned on the other text: When reading a sentence about Thom Yorke's birthplace, we already need to know that Yorke is the Radiohead singer, if we wish to then be able to answer the above question. This makes PCW less suitable for these types of tasks in its current form, and we leave it as an open direction for how to encode sequential relation between windows (perhaps by some further training).  2021) proposed a noisy channel approach to boost fewshot performance. Our framework is orthogonal and thus complementary to these methods, as we are mainly focused on how to increase the number of examples shown to the model. Our approach is also more general as it seamlessly supports generative tasks as well.

Expanding the Context Window
The issue of a limited context window has been the focus of many studies that tried to alleviate the memory footprint of self-attention.  2022) suggest SLED, an encoder-decoder model for long texts, which encodes short overlapping chunks of the input text, and fuses the information in the decoder, a-la Fusion-in-Decoder (Izacard and Grave, 2020). Similarly to our approach, both Izacard and Grave (2020) and Ivgi et al. (2022) employ off-the-shelf architectures, but those methods require further training. Among all mentioned methods, our work is the first that utilizes existing LLMs for longer inputs without any further training.
In concurrent work, Hao et al. (2022) suggest using multiple context windows, while scaling the context tokens' attention weights. We show that large gains can be made without scaling the attention weights, and we demonstrate particularly large gains for tasks with diverse output spaces.
Moreover, they focus on LLMs with non-learned positional encoding (sinusoidal, Vaswani et al. 2017 andALIBI, Press et al. 2022) and only show results in the ICL setting. In contrast, we show that PCW is effective for more common LLMs that have learned positional embeddings, and show that PCW obtains gains both in ICL and in document retrieval settings.

Conclusion and Future Work
In recent years, a multitude of successful approaches have been proposed for allowing Transformer-based language models to leverage large amounts of text during inference, leading to a variety of dedicated architectures. In parallel, however, the mainstream LLM production line of new models with "regular"-up to several thousand tokens-context window sizes enjoys faster progress in the form of scaling, innovation, and data updating. This paper introduced Parallel Context Windows (PCW): A simple approach for allowing any offthe-shelf LLM to broaden the scope of text it can access during inference. We showed the effectiveness of PCW in the framework of in-context learning, where access to a context that is larger by a factor of B implies learning from B times more training examples. Our results show that PCW is more effective than the vanilla single context window approach for in-context learning over a broad set of multi-class classification tasks, suggesting that PCW could improve in-context learning in tasks with diverse input or output spaces. We also showed promising signals for applying PCW for multiple retrieved document reading.
Two key directions of future work strike us as particularly promising. First, by demonstrating that an off-the-shelf LLM can attend to substantially larger quantities of text via PCW, our results motivate further investigation of the PCW method in other settings in which it would be desirable to apply mainstream LLMs over long text sequences. Second, though our results suggest that PCW is effective without further training, we believe that further (short) training of an LLM with parallel context windows could further enhance the abilities demonstrated in this work.

Limitations
We presented Parallel Context Windows (PCW), a simple approach that alleviates context window restrictions for any off-the-shelf LLM, without ad-ditional training. We showed the potential of this method on a variety of models and datasets. With that, our method does have some limitations.
The number of context windows has a limit, and needs to be predetermined. Similarly to vanilla in-context learning, the number of examples to include in the prompt must be selected beforehand. For PCW, it is also required to select the number of context windows, B. In this paper, most of the results are for B = 3. We experiment in Appendix C with the choice of B. The results are task dependent, but at a high level we find that there are diminishing returns around B in the range of 5 to 7. We leave further investigation of how to effectively benefit from more windows for future work.
Not effective for all types of tasks. As discussed in Section 3, PCW shows impressive gains in ICL for tasks such as multi-class tasks classification as well as information extraction. However, for some tasks, PCW does not improve performance. This might indicate that some tasks are not suited for parallel processing. Section 4.2 demonstrated that PCW is more suitable for cases where the input text could be divided into few independent inputs, but it remains an open question as to whether tasks, such as long text generation, would benefit from PCW.

A.1 PCW Implementation Details
Handling context windows of various lengths Section 2 thoroughly describes PCW method for cases where each window has the same number of tokens. Throughout all our experiments, this was rarely the case. We considered two variations of PCW to handle these cases. The first was whether to use left or right indentation of the windows, meaning whether all of the windows should begin or end in the same position id. To avoid any discontinuity in the assignment of position ids, it is also possible to pad the windows with some dummy tokens (e.g., new line). Left indentation was found to be the most preferred option in ICL setting, while padding did not appear to be significant. For that reason, and considering the simplicity of this solution, we chose to use left indentation in all of our experiments. It is important to note that in the PCW implementation, all the windows and the task tokens attend to a single shared BOS token. We found that having multiple BOS tokens negatively affected our results.
Splitting the inputs into windows For the experiments described in Section 3, we assigned an equal number of n max samples per window, and only attempted to balance the lengths of the windows by greedily switching long and short samples between windows. n max was calculated according to the following formula: where N is the context window size, T max is the length of longest test sample and D 90 is the 90th percentile of the train samples' lengths. To avoid unwanted effects due to outliers, we removed the longest percentile of train and test samples.
In the experiments described in Section 4.1, we divided the documents according to the retriever's ranking, so that the last document in each window would have the highest ranking in the window. It should be noted that the training examples were not parallelized. The same randomly chosen examples were used for both baseline and PCW, and new examples were drawn for each test sample. For the experiment described in Section 4.2, the division between windows was random.

A.2 Evaluation Details
Classification A common way to evaluate models in the in-context learning setup is to iterate over all possible labels for each test sample and check which label receives the highest probability according to the LM. This approach is problematic where a large number of classes is present, especially when some class names are split into multiple tokens. To save computational costs, we implemented constrained greedy decoding, at each step allowing only tokens that could result in a valid label. It is important to acknowledge that this evaluation method could result in slightly different performance for both the ICL baseline and for the PCW approach. However, since most of the labels only contained few tokens in both J1's & GPT's tokenizers, and the first token is usually quite indicative to the nature of the label, this effect should be minor.
Information extraction The LMs' predictions for the information extraction tasks were generated with greedy decoding at temperature 0, similar to Zhao et al. (2021). We used Exact Match (EM) or F1 as the metric of choice for the extraction tasks.
Computational cost As discussed in the beginning of this appendix, we used restrictive decoding for the majority of the experiments in the paper. This usage greatly reduced the computational cost of our experiments: Most classification tasks were preformed in 1-4 GPU hours for all models (besides experiments with J1-Jumbo, which lasted roughly 10-50 GPU hours per experiment). The experiments described in Section 3.3 and Section 4 took up to 20 GPU hours each. . TREC and NLU datasets were used with both fine and coarse grained labels. The different formats used in all of tasks, as well as the values of n max for both J1 and GPT2 models, can be found in Table 9. We have also used 6 more datasets from extraction and multiple-choice domains, which were only evaluated with J1 models: • ATIS airlines (Zhao et al., 2021); n max = 67.
For Section 4 we used Natural Questions (Kwiatkowski et al., 2019) and HotpotQA (Yang et al., 2018) datasets. All datasets were evaluated with the standard test set or validation set in the absence of a public test set. As described in Section 3, we subsampled all test sets for the ICL experiments. In Natural Questions dataset, we used half of the test set (its original size was 3610 samples) to speed up evaluation. We used the full HotpotQA validation set, containing 7405 samples. The datasets are all in English.
The majority of the datasets can be found in the Huggingface Datasets package (Lhoest et al., 2021), apart from the information extraction tasks ATIS airlines (Hemphill et al., 1990) and MIT movie genre (Liu et al., 2012), which were taken from Zhao et al. (2021), and Natural Questions (Kwiatkowski et al., 2019) which was loaded and incorporated with retrieved documents using Pyserini (Lin et al., 2021). Since loading the training set via Pyserini is not currently a built-in option, we used the validation set of Natural Questions as an effective train set. We found this decision reasonable since we only used the training set for few-shot prompting, and we did not optimize any parameters using the validation set.
We have tried our best to track the licenses of the datasets used in this work. The license information that we have managed to find is as follows: SST-2, RTE, SST-5, NLU Scenario, NLU Intent, BANK-ING77 and SQuAD-CC-BY 4.0, adversarialQA-CC-BY-SA 4.0, DBPedia-CC-BY-SA 3.0.

B.2 Preprocessing and Formatting
In all ICL experiments, we used only pairs of inputs and expected outputs, without any instructions. For the classification datasets, we mainly used formats found in Lu et al. (2021) when applicable. For extraction and multi-choice datasets, we used the formats from Brown et al. (2020). We generated new formats for classification datasets with dozens of labels, which are rarely used in few-shot setting. The formats were based on wordings and labels used in HuggingFace, with minor modifications to make the formats more similar to natural language (e.g., replacing '_' with spaces in label names). Details of the classification prompts can be found in Table 9. Experiments from Section 4 were formatted similarly to the work done by (Lazaridou et al., 2022). Their prompts formats are presented in Table 10.

C The Effect of the Number of Context Windows on Performance
When using PCW for ICL, the number of parallel context windows (B) affects the number of incontext training examples. We used B = 2, 3, 4 in preliminary experiments, and saw that for classification tasks, the optimal choice of B depends on the number of unique labels in the task. We observed that the performance for tasks with a high number of classes was improved when we increased B, while the optimal choice of B for tasks with few classes tended to be 1 or 2 (See Tables 6,7 and 8). For simplicity, We chose to display results for B = 3 in all of the main experiments. Nevertheless, we were curious to see how far we could push the number of parallel context windows before the model stopped benefiting from them. We used a representative subset of three datasets with a varying number of classes, and increased B from 1 to 8. The number of training sets and the size of test set for those experiments were set on 15 and 125 respectively.
As seen in Figure 7, when the number of context windows is increased, datasets with a large number of classes, such as AGNews and DBPedia (with 4 and 14 labels, respectively), continue to improve (with a convergence at around B = 6). Hence, PCW can achieve even greater improvements by optimizing B per dataset. Increasing the number of context windows, however, seemed to harm the performance of SST-2.
Identifying which tasks benefit from large paral- lel data processing would be an interesting research direction in the future. For now, we recommend choosing an optimal B on the development set (if available) for best results. In the absence of a development set, a conservative choice, such as B = 3, may be beneficial. It is possible to investigate the behaviour of PCW with a larger number of windows, but we find it irrelevant for most practical cases of ICL, where an extremely large number of training samples would allow finetuning a model. We leave exploring this issue for future work.  ICL setting. We formatted and evaluated the tasks as in Brown et al. (2020), by providing few-shot examples with the correct completion followed by an example of context only, and comparing the average per-token LM likelihood of each possible completion. We did not use the calibration from Brown et al. (2020). We used the same setup as described in Section 3.1, with the exception of reducing the number of sampled training sets and the test set size used for J1-Grande in the Open-BookQA experiment to 15 and 125, respectively. The results shown in Table 4 show that increasing the number of examples of in-context training under the PCW setting improved the performance of J1-Grande in the OpenBookQA task, but did not significantly affect the other scenarios. Based on this observation, it seems that PCW has the potential of providing gains for multiple-choice tasks in specific scenarios, but further analysis should be made based on more datasets to better understand it. We leave this for future work.