OpenICL: An Open-Source Framework for In-context Learning

In recent years, In-context Learning (ICL) has gained increasing attentionand emerged as the new paradigm for large language model (LLM) evaluation. Unlike traditional fine-tuning methods, ICL instead adapts the pre-trained models to unseen tasks without any parameter updates.However, the implementation of ICL is sophisticated due to the diverse retrieval and inference methods involved, as well as the varying pre-processing requirements for different models, datasets, and tasks. A unified and flexible framework for ICL is urgently needed to ease the implementation of the aforementioned components.To facilitate ICL research, we introduce OpenICL, an open-source toolkit for ICL and LLM evaluation. OpenICL is research-friendly with a highly flexible architecture that users can easily combine different components to suit their needs.It also provides various state-of-the-art retrieval and inference methods to streamline the process of adapting ICL to cutting-edge research.The effectiveness of OpenICL has been validated on a wide range of NLP tasks, including classification, QA, machine translation, and semantic parsing. As a side-product, we found OpenICL to be an efficient yet robust tool for LLMs evaluation. OpenICL is released at https://github.com/Shark-NLP/OpenICL.


Introduction
The rise of large language models (LLMs) (Brown et al., 2020;Zhang et al., 2022a;Scao et al., 2022) has shown impressive emergent In-Context Learning (ICL) ability (Wei et al., 2022a). Different from finetuning which requires parameter updates, ICL can perform inference with model parameters frozen. ICL sidesteps the resource-intensive nature of fine-tuning, yet still yields comparable results to fine-tuned models in specific tasks Lu et al., 2022;Gao et al., 2021a). However, we observed a lack of a unified framework for ICL. Implementations from existing projects are often high-customized to their own needs, thus making further development and comparisons with previous approaches a challenge.
The basic ICL pipeline contains two steps: retrieval and inference. Given a testing input X ′ , in the retrieval stage, several examples from the training set are retrieved as in-context demonstrations. In the inference stage, these demonstrations are prepended before X ′ and fed into the LLM to generate the prediction. Researchers have explored various methods for both retrieval(e.g., BM25 (Robertson and Zaragoza, 2009), TopK (Liu et al., 2022;Gao et al., 2021a) and VoteK (Su et al., 2022)) and inference(e.g., perplexity-based (Brown et al., 2020), channel-based , and Chain-of-thoughts (Wei et al., 2022b)). However, these methods are often implemented under different frameworks, and/or evaluated using different LLMs and tasks. These inconsistencies make systematic evaluations and comparisons of various methods challenging, thus hindering the development of better ICL methods.
To address this issue, we present OpenICL, an open-source and easy-to-use toolkit for ICL. OpenICL has many state-of-the-art retrieval and inference methods built in to facilitate systematic comparison and fast research prototyping. OpenICL also provides a unified and flexible interface for the development and evaluation of new ICL methods. Users can easily incorporate different retrieval and inference methods, as well as different prompt instructions, into their pipelines. To validate OpenICL's implementation and design, we use OpenICL to evaluate LLMs on several NLP tasks, including classification, question answering, translation, and semantic parsing. Our contributions are summarized as follows: • We propose OpenICL, an easy-to-use and extensible ICL framework for zero-/few-shot evaluation of language models • OpenICL provides a wide range of ICL methods, LLMs, and tasks, requiring as little as a few lines of code to use and paving the way for more extensions in the future.
• We provide complete tutorials to walk users through the framework, thus facilitating research and development of ICL.

Related Work
In-context Learning Besides the classic "pretrain and fine-tune" paradigm, Brown et al. (2020) proposed In-context learning (ICL), a new paradigm that leverages pre-trained language models to perform new tasks without any gradientbased training. It appends a small number of training examples as prompts before the test input, and have shown to be able to improve LLMs' performance in few-shot scenarios and generalize to a wide range of downstream tasks, such as information retrieval (Tay et al., 2022), fact checking (Rae et al., 2021), commonsense reasoning (Geva et al., 2021), arithmetic reasoning (Cobbe et al., 2021), machine trainslation (Agrawal et al., 2022;Lin et al., 2021a), and data generation , etc. Aside from those early successes, researchers have developed more sophisticated ICL methods that require some intermediate reasoning steps. Among them, chain-of-thoughts (CoT) is the first attempt that significantly surpasses the previous state-of-the-art methods on many reasoning tasks (Wei et al., 2022b). After that, different variants of CoT have been proposed to strengthen its performance, such as Self-Ask (Press et al., 2022), iCAP , Least-to-Most prompting (Zhou et al., 2022), and Selection-Inference (Zhang et al., 2022b;Fu et al., 2022).
Despite the surprising performance, ICL has been criticized for being very sensitive to the choice and ordering of in-context examples Lu et al., 2022). To address this problem, different criterion and context construction methods have been proposed. Gao et al. (2021a) and Liu et al. (2022) select examples that are closer to the test input in the embedding space; a line of work (Su et al., 2022;Levy et al., 2022;Ye et al., 2023) select the most representative examples in the training set to encourage diversity of in-context examples;  observe that Minimum Description Length (MDL) principle can be an effective criterion for in-context example selection.
Prompt Learning Prompt learning ) is a special case of ICL without any incontext examples. Prompt learning comprises various topics including manual template engineering (Petroni et al., 2019;Brown et al., 2020), automated template learning (Wallace et al., 2019;Shin et al., 2020;Li and Liang, 2021), and answer engineering (Gao et al., 2021b;Schick and Schütze, 2021). We refer the readers to the usage of Open-Prompt (Ding et al., 2021) which is a toolkit specially designed for prompt learning. In comparison, OpenICL focuses more on integrating various exemplar retrieving approaches and inference strategies for in-context learning. Note that OpenICL can also seamlessly support prompt learning by setting the number of in-context examples to zero and specifying the manual or pre-searched prompt templates by OpenPrompt for different tasks.

OpenICL
In this section, we first explain OpenICL's design principles. Then, we will briefly describe OpenICL's two major components, namely, the Retriever and Inferencer.

Design Principles
The design principle of OpenICL is to facilitate incontext learning research and enable efficient and robust large language model evaluation. In detail, we consider the following principles: [P1: Modularity] Since ICL is a fast-evolving research field, the design of OpenICL should be decoupled such that different components can be easily modified to support latest methods and/or combined to suit various tasks and application needs.
[P2: Efficiency] Nowadays, large language models can have hundreds of billions of parameters. To support inference at such a massive scale, OpenICL should be optimized to enable efficient parallel inference.
[P3: Generality] ICL has been widely used in all fields in NLP, so OpenICL needs a flexible interface that enables it to work with various LLMs, tasks, retrieval methods, and inference approaches.

Index Set
( , ) Figure 1: Overview of the architecture in OpenICL. OpenICL first obtains proper in-context examples from an index set for each test input or for the whole test set via retrieval methods (e.g., TopK or VoteK) specified by the users. Then the in-context examples and test input are concatenated into a single sequence based on the provided prompt template. Finally, all the prompts are fed into the language model to infer the output through defined inference strategies (e.g., Chain-of-thought). These examples, as well asx, are then formatted according to the userdefined prompt template and concatenated to form a text sequence. After that, the Inferencer digests these sequences and fed them into the LLMs to obtain the model predictionŶ .

Modularity
To satisfy Principle P1, OpenICL adopts a looselycoupled design between components. These components separate the data pre-processing, retrieval, and inference processes with very flexible interfaces that allow easy customization to fit specific needs. Two major components are detailed below: Retriever Retriever is responsible for retrieving in-context examples from the pre-existing training data. This module supports both corpuslevel (i.e., only retrieving one group of examples for the whole test set) and instance-level (i.e., retrieving examples for each testing input individually) retrieval methods. OpenICL primarily supports learning-free retrieval methods as follows: • Random: Early practice (Brown et al., 2020) of ICL often randomly select examples to construct the context. Although Random brings high variance for ICL performance, it is still the popular choice when there are only a few demonstrations available (Wei et al., 2022b;. • Heuristic method: To overcome the disadvantage of Random, various semantic similarity based retrieval methods have been proposed and shown great promise, such as BM25 (Robertson and Zaragoza, 2009), TopK (Liu et al., 2022;Gao et al., 2021a), and VoteK (Su et al., 2022).
• Model-based method: More recently, researchers have explored using models' confidence in the output to select and order examples, such as entropy (Lu et al., 2022) and MDL .
OpenICL has implemented the existing methods above to facilitate future research and systematic comparison. Furthermore, the flexibility of the Retriever module allows practitioners to select the retrieval method and make further modification that best suits their task and data. The interface of Retriever also allows users to pack those in-context examples and use them somewhere else.
Inferencer Inferencer invokes the pretrained language model to generate predictions based on the concatenation of in-context examples and testing input. The Inferencer supports various inference methods: • Direct: Brown et al. (2020) use tokens in the vocabulary to represent candidate answers and select the final prediction using the one with the highest probability.
• Perplexity: (Brown et al., 2020) compute the sentence perplexity of the sequence concatenation of input and candidate answers and select the final prediction using the one with the lowest perplexity.
• Channel:  proposed to utilize channel models (Yu et al., 2016;Yee et al., 2019) to compute the conditional probability in a reversed direction, i.e., estimating the likelihood of input query given the label.
The flexibility of Inferencer also allows users to recursively invoke it to support multi-stage ICL methods, such as chain-of-thought (Wei et al., 2022b) and selection-inference (Creswell et al., 2022). Additionally, Inferencer can be augmented with a scorer to evaluate its prediction.

Efficiency
To satisfy Principle P2, we equip OpenICL with various parallelism techniques to enable efficient inference for large-scale models.
Data Parallel Data parallel (Li et al., 2020) is a common technique used in parallel computing to improve the efficiency of large-scale computation tasks. OpenICL implements data parallelism to improve the performance of both the retrieval and inference steps. During retrieval and inference, data is divided into smaller batches for processing. Additionally, for models that can fit into GPU's VRAM, OpenICL implements data parallelism by sharding the data across multiple GPUs and performing parallel inference on each GPU with a complete copy of the model. This significantly increases the inference speed when working with large datasets.
Model Parallel In the era of LLMs, models often have billions or hundreds of billions of parameters that exceed modern GPUs' capacity. To handle this problem, we resort to model parallel : a parallel computing technique that divides a large deep learning model into smaller sub-models, each of which can be run on a separate GPU. OpenICL supports model parallelism that users can easily parallelize their models with minimal modification to the code. Currently, we support Megatron  and Zero (Rajbhandari et al., 2019).

Generality
To satisfy Principle P3, OpenICL is designed to maximize users' productivity by supporting a wide range of models, tasks, and methods: [Model] OpenICL supports both decoder-only LMs (e.g., GPT family (Radford and Narasimhan, 2018;Radford et al., 2019;Black et al., 2021;Wang and Komatsuzaki, 2021;Black et al., 2022), and encoder-decoder-based LMs (e.g., T5 (Raffel et al., 2020)). We also provide two alternatives for accessing the model: users can directly load model checkpoints for evaluation or access a model via API (e.g., OpenAI's GPT-3 series models; Brown et al. 2020;Chen et al. 2021;Ouyang et al.). 1 [Tasks] With the help of OpenICL, users can easily conduct experiments on both classification and generation tasks. OpenICL integrates Hugging-Face's datasets 2 such that users can access and download thousands of NLP tasks with ease.
[Methods] As aforementioned, OpenICL provides broad support for ICL methods that cover both retrieval and inference. Furthermore, OpenICL offers the flexibility to return the results of the Retriever and Inferencer in a stepby-step manner, making it easy to integrate these intermediate results into other projects.

Toolkit Walkthrough
In this section, we demonstrate OpenICL by walking readers through several typical ICL use cases.
Example 2. Figure 3 shows how to use OpenICL to work with generation problems. We consider the popular machine translation dataset WMT16 (Bojar et al., 2016). As in Example 1, we can easily load the dataset, define the prompt template, and initiate the retriever, by feeding new parameters to the function, respectively. The major API difference from Example 1 is that (i) we add some pre-processing for the translation task (line 5); (ii) PPLInferencer is replaced by inferencer tailored for generation (line 16); (iii) we use BLEU to evaluate model performance.
Example 3. OpenICL also supports more advanced ICL methods, as shown in Figure 4. Users can seamlessly switch to CoT by only modifying two lines of code: line 14 defines the template for CoT and line 15 initiates the inferencer with GPT3 using OpenAI's API. Similar multi-step ICL methods such as Self-Consistency  and Selection-Inference (Creswell et al., 2022) can also be easily implemented by inheriting the superclass Inferencer designed in OpenICL.

Conclusion
We present OpenICL, an open-source toolkit for In-context learning. OpenICL provides a convenient and flexible interface for in-context learning practice and research. Our modular design allows it to support a wide range of LLMs, tasks, and ICL methods with ease. We implement both model parallelism and data parallelism to make inference of large models more efficient. OpenICL is highly extensible, and we will continue to update it to keep pace with the latest research. Despite the promising results, ICL is still in its early stages, and many challenges remain. We believe OpenICL will be a valuable resource for researchers and practitioners to facilitate their research and development.