What In-Context Learning “Learns” In-Context: Disentangling Task Recognition and Task Learning

,


Introduction
Large language models (LLMs) have demonstrated the ability to perform in-context learning (ICL), i.e., "learning" to perform a task purely from examples in the context without any parameter updates (Brown et al., 2020).This powerful and flexible phenomenon enables LLMs to be used as general-purpose models that can perform any task with a small set of labeled examples.
However, there is still no consensus on how incontext learning works.Some previous work hy- pothesizes that during pre-training, LLMs implicitly learn tasks required for downstream applications, and the in-context demonstrations merely provide information that allow the model to recognize which task is required (Xie et al., 2022).Min et al. (2022) show empirical evidence of this hypothesis by demonstrating that ICL performance is insensitive to the usage of ground-truth labels.
On the other hand, Akyürek et al. (2023); von Oswald et al. (2022) construct theories that Transformer-based models may perform implicit gradient descent to update an "inner-model", and Dai et al. (2023) demonstrate similarities between in-context learning and explicit fine-tuning through a series of metrics on real-world datasets.Such hypotheses assume the correct input-output mappings are important and ICL actually performs implicit learning over demonstrations.
In this paper, we disentangle ICL into task recognition (TR), which recognizes the task from demonstrations and applies LLMs' pre-trained priors, and task learning (TL), which learns a new input-label mapping from demonstrations.In common ICL scenarios where ground-truth labels are provided, TR and TL take effect simultaneously.
We propose two settings to tease them apart: 1) RANDOM, where the labels are uniformly sampled from the label space (Min et al., 2022), in order to restrict LLMs to only apply TR; 2) ABSTRACT, where the labels are replaced with abstract symbols (e.g., numbers or letters) that never co-occurred with the inputs in pre-training.We focus on how the two abilities in ICL evolve with two factorsmodel sizes and numbers of demonstrations, which have been neglected in related literature.
Through extensive experiments with a series of classification datasets on GPT-3 (Brown et al., 2020), LLaMA (Touvron et al., 2023), and OPT (Zhang et al., 2022), we find: • The gap between GOLD and RANDOM is small with smaller models, corroborating with Min et al. (2022).However, with larger models and more examples, the gap becomes larger.This suggests TR plays a significant role in ICL, but it does not scale with increasing parameters or examples.
• LLMs also perform TL, which emerges with larger models and more demonstrations.With the largest model and more than 16 examples, ABSTRACT outperforms RANDOM, pointing to a paradigm shift in in-context learning at scale.Together, our findings provide a better way to understand ICL behaviors.2

Task Recognition and Task Learning
An LLM (parameterized by θ) performs ICL by conditioning on the input-label pair demonstrations D demo = (x 1 , y 1 , x 2 , y 2 , . . ., x K , y K ) and the test input x test to predict the label y test ∼ p θ (y | D demo , x test ), where the demonstrations elicit a mapping f : X → Y, x ∈ X , y ∈ Y.We delineate two ways an LLM can leverage in-context demonstrations: task recognition and task learning.
Task recognition (TR) represents models' ability to recognize the mapping f purely by observing the input distribution {x i } K i=1 and the label distribution {y i } K i=1 , without the provided (x i , y i ) pairs.
The LLM then applies its pre-trained priors to the recognized f .Formally, when only TR is enabled, which suggests TR does not rely on the pair information.For example, an input distribution of movie reviews and a label distribution of "The sentiment is positive/negative" can be easily recognized as a sentiment classification task due to their prevalence during pre-training, and LLMs can make reasonable predictions without explicitly "learning" the task via ground-truth demonstrations.This leads to observations that the model can still perform well even when we provide wrong input-label mappings, e.g., "The movie is great.The sentiment is negative" (Min et al., 2022).Task learning (TL), on the other hand, characterizes how the model learns a new mapping from the input-label pairs through demonstrations.Unlike TR, TL allows models to learn novel mappings and thus correct input-label pairs will be crucial.
We posit that the two mechanisms occur under separate conditions, as recognizing an already learned task is easier than learning a new mapping.Models are able to perform TR at a small scale, but this ability does not drastically improve with increasing model sizes and demonstrations; on the other hand, TL improves significantly when model sizes and numbers of demonstrations increase.To show the above phenomenon, we disentangle TR and TL through label space manipulation, including three different setups (examples in Figure 1): • GOLD: the standard ICL setting where we use natural prompts and gold input-label pairs.This setup reflects both TR and TL abilities.
• RANDOM: similar to Min et al. ( 2022), we use the same natural prompts as GOLD and sample demonstration labels uniformly at random from the label space.This setup reflects TR only.
• ABSTRACT: we use minimal prompts (which provide no task information) and characters with no clear semantic meanings (e.g.numbers, letters, and random symbols) as the label for each class.We found that even abstract labels may have biases in pre-training, e.g., "0" is biased towards negative.Hence, for each prompt x 1 , y 1 , . . ., x K , y K , we randomly sample a 1-1 mapping φ : Y → Y * to avoid any bias, and no task-specific information is leaked in either the prompt template or the label  space.To evaluate the model's ABSTRACT performance, we measure its accuracy using φ(y test ) as target labels.Since these input-label mappings are never seen in pre-training, it reflects the TL ability.
In the following sections, we conduct comprehensive experiments with the above three different settings under two axes -model sizes and numbers of demonstrations -and show how TR and TL manifest under different conditions.
3 Experimental Setup

Datasets
We experiment on 16 classification datasets across 4 type of tasks: sentiment analysis, toxicity detection, natural language inference/paraphrase detection, and topic/stance classification.All datasets and references are in Appendix A. Our dataset selection largely follows Min et al. ( 2022), but we exclude multi-choice datasets since it is difficult to apply our ABSTRACT experiments on them.

Task Setup
We adopt the sample-based evaluation protocol: for each test example, we sample a different set of demonstrations from the training set.We manually design 3 prompt templates for each type of classification tasks in a similar style to the prompts from Min et al. (2022).We report the mean by averaging across datasets and prompts, and standard variation across different prompts for each datapoint.For GPT-3, we sample 150 examples for each dataset.We use fewer examples due to budget constraints, and GPT-3 presents lower variance than other model families.For OPT and LLaMA, we sample 1,350 examples for all datasets.
We design two kinds of prompts: natural language prompts (Table 1), which are similar to the manual prompts in Min et al. (2022), and minimal prompts (Table 3), which remove any natural language instructions for the task.For ABSTRACT, we tested three types of label choices: numbers (0, . . ., N − 1, where N is the number of classes), letters (N letters from A, B, C, . . .), and symbols (first N symbols of "@", "#", "$", "%", "*", and "∧").For each test example, we randomly sample a new mapping between labels and abstract characters.We report the number abstract labels in all the main results and compare the three forms in §4.2.

Results
Figure 2 shows our main results with GPT-3, LLaMA, and OPT with our 3 settings: GOLD, RANDOM, and ABSTRACT.Below we summarize the trends of TR and TL across different conditions.

Main Results
Summary of overall trends.We first verify that GOLD consistently performs the best across model families and number of demonstrations, which is expected given that the GOLD setting provides the model with all information.Overall, the RANDOM curves do not increase with either model sizes or number of demonstrations, remaining largely flat; considering the scenario with small model sizes and few examples (K = 8), there is an insignificant gap between RANDOM and GOLD.Meanwhile, the ABSTRACT curves demonstrate an increasingly steep slope as the model sizes and the number of demonstrations grow; with small models or small K, ABSTRACT mostly underperforms RANDOM, whereas ABSTRACT with largest models and K = 32 performs well above RANDOM (and may even be competitive with GOLD).We note that the OPT curves demonstrate significant variance, which we hypothesize to be a result of the models potentially being under-trained.We elaborate the takeaways on TR and TL below.
Task recognition is a broader capability across scales.For all model families, the RANDOM setting shows similar performance at all sizes and numbers of demonstrations.Moreover, TR performance is significantly stronger than the random baseline, even with small models and few examples.For instance, even the smallest 350M parameter models are able to recognize the task using just 8 examples, drawing around 10 points of average performance lead against the random baseline for GPT-3 ada and 5 points for OPT-350M.This shows that task recognition from in-context examples does not drastically scale with model sizes or numbers of examples.
Task learning is enabled with scale.We observe that TL is dependent on model sizes: smaller models perform roughly the same across all numbers of demonstrations (see Figure 6).On the other hand, larger models can utilize the provided mapping information and perform TL, as ABSTRACT (TL) performance increases drastically with larger sizes (first row of Figure 2).When using a larger model, the results also improve as the number of demonstration increases (second row of Figure 2).With only 16 examples, OPT-66B and davinci are able to match the performance of GOLD while using a new label mapping.While LLaMA-65B's ABSTRACT is not as competitive as its GOLD, the trend of improving ABSTRACT performance with larger size s or larger K is clear.This suggests that TL is only enabled by scales and further improves with more demonstrations.

Further Analysis
The trends for task learning generalize across different types of abstract labels.In Figure 3, we show ABSTRACT results with number, letter, and symbol labels respectively.We observe that all three versions show a similar trend and coincide with our main results.Numbers and letters perform consistently better than symbols.This may be because letters and numbers appear more frequently in the pre-training corpus, and therefore make for a more "natural" label space.
Task difficulty affects the trends.We notice that ABSTRACT scales better with sizes and examples when the task is simpler.In Figure 4 we compare two types of tasks: sentiment analysis and natural language inference (NLI).Since NLI is more difficult, we observe that it produces a flatter AB-STRACT curve, suggesting that the model relies more on the natural prompts and pre-training priors to solve those tasks.We demonstrate the full task-type breakdown results in §C.More recently, several works have explored theoretical frameworks in which ICL can be seen as im-plicit gradient descent, treating a forward pass over the in-context demonstrations as an "update" to an implicit internal model.(Akyürek et al., 2023;von Oswald et al., 2022;Dai et al., 2023).For mechanistic perspectives on ICL, Olsson et al. (2022) and Bansal et al. (2022) identify induction heads (subnetworks that perform in-context pattern recognition) in small and large models, respectively.

Related Work
While our conclusions align with aspects of previous studies, our work contributes novel insights on multiple axes.Min et al. ( 2022) also show that even small models can perform TR and argue that the performance gap between GOLD and RANDOM is consistently small, but most of their experiments are on ≤13B models with 16 demonstrations; we show that as model sizes scale, GOLD tends to improve while RANDOM does not.Thus, the performance deficit of RANDOM grows as models become larger.Yoo et al. ( 2022) also perform similar experiments to RANDOM and ABSTRACT, but they do not deeply investigate the effect of model sizes or numbers of demonstrations.Contemporary work Wei et al. (2023) obtain similar results; additionally, they show that instruction-tuning strengthens the model's semantic priors more than it improves TL.However, they primarily focus on closed-source models, whereas we also conduct experiments on public models such as LLaMA and OPT.Collectively, our findings offer a comprehensive understanding of how ICL works across scales.

Conclusion
While previous work often studies ICL as an umbrella term, regardless of model sizes and numbers of examples, we argue that there are two distinct characterizations of ICL -task recognition and task learning -and demonstrate that they emerge under different conditions.Even small models are capable of performing TR, but this ability does not scale.On the other hand, TL is an emergent ability of large models; small models are unable to perform TL even when provided with more demonstrations, whereas large models can leverage more demonstations to improve their TL performance.We suggest that future work on ICL should distinguish the two phenomena and clearly state the conditions under which the experiments are conducted.

Limitations
Though LLMs with in-context learning are capable of all kinds of NLP tasks, this work is limited to classification tasks because they are easier to be adapted to our RANDOM and ABSTRACT setup.We leave other types of NLP tasks as future work.
Another limitation of our work lies in the definition and discussion of task learning.Though we empirically show that large models are capable of acquiring a novel mapping to abstract labels like numbers or letters, how models "learn" mechanistically is still elusive.As suggested in previous work, LLMs may conduct implicit gradient descent over demonstrations, or they may alternatively map the patterns shown in the demonstrations back to concepts learned in pre-training.To some extent, these mechanisms could be considered an advanced form of "task recognition".This work only designs experiments to better observe and disentangle TR and TL, and we look forward to further studies that reveal more insights about the mechanistic innerworkings of these phenomena in ICL.

B Prompt Templates
For each task category (e.g.sentiment classification, topic detection), we manually design three natural language templates.Depending on exact specifications for the dataset, templates may be adjusted to better reflect the task (e.g."Is this atheist?"for tweet_eval_atheist).We apply these templates to the natural language label sets (GOLD and RANDOM).All prompts are presented in Table 1.
We also design two task-agnostic variations on three minimal templates for ABSTRACT: one for single-sentence tasks and one for multi-sequence tasks (e.g.NLI tasks).We use these minimal templates on the abstract language label sets in order to prevent the model from being exposed to any information regarding the task from the prompt design.All minimal templates are presented in Table 3 All prompts are designed to be answered with single-token responses (e.g. "Yes/No", "True/False", "positive/negative/neutral", "0/1/2", "A/B/C") so that we can directly check models' last token prediction results instead of applying decoding methods.

C More Results
We demonstrate average model performance with respect to number of parameters in Figure 5.It is clear that small models struggle to perform ABSTRACT, regardless of how many examples, whereas the largest models (especially GPT-3 Davinci and OPT-66B) are able to perform AB-STRACT.Additionally, their performance improves even more when more demonstrations are provided.
We demonstrate average model performance with respect to numbers of demonstrations in Figure 6.We can see a clear trend that RANDOM (TR) does not change much but ABSTRACT improves drastically with more examples, especially for GPT-3 Davinci and OPT-66B.
Figure 7 shows all the ABSTRACT results and demonstrates a similar trend to what §4.2 describes.
Figure 8, Figure 9, Figure 10, and Figure 11 show task-type breakdown results.Though individual task-type results are more noisy, we can make a similar observation compared to the main result -ABSTRACT (TL) scales better with sizes and numbers of examples compared to RANDOM (TR).

Figure 1 :
Figure 1: We perform experiments in three settings: RANDOM (top), ABSTRACT (middle), and GOLD (bottom).Our experiments demonstrate that task recognition (TR; shown by RANDOM) does not scale with model sizes and number of demonstrations, while task learning (TL; shown by ABSTRACT) does.

Figure 3 :Figure 4 :
Figure 3: Performance of three types of ABSTRACT labels: numbers, letters, and symbols on davinci and OPT-66B.
Many works have attempted to deepen empirical or theoretical understanding of ICL since its emergence inBrown et al. (2020).For instance, Xie et al. (2022) present a theoretical framework where latent "concepts" parameterize each document in pre-training.They posit that all concepts have been learned in pre-training; thus, ICL is the result of implicit Bayesian inference, where the LM uses incontext demonstrations as evidence to identify the correct concept.Min et al. (2022) present empirical evidence for this framework by showing that only limited information, rather than true input-label mappings, is needed to perform ICL.Other works investigate the impact of the pretraining corpus on ICL.Chan et al. (2022) identify properties of the pre-training distribution that enable ICL behavior, including burstiness, label multiplicity, and a long-tailed class distribution -all of which are satisfied by natural language.Razeghi et al. (2022) show that the frequencies of terms in the pre-training corpora is positively correlated with model performance.Kirsch et al. (2022)  show that both a rich training distribution and a sufficiently large model are critical to the development of in-context learning abilities.

Figure 6 :Figure 7 :Figure 10 :Figure 11 :
Figure6: Averaged accuracy across 16 datasets for, LLaMA (middle), and OPT (bottom).x-axis shows number of demonstrations in the prompt.For each model, we run experiments for RANDOM (left), AB-STRACT(middle), and GOLD (right) demonstrations.Variance is calculated across three templates.

Table 1 :
Natural prompts used as input in GOLD and RANDOM settings for single-sentence datasets.<s> denotes the input sequence; labels are illustrated in red.

Table 2 :
Natural prompts used as input in GOLD and RANDOM settings for multi-sentence datasets.<s1> and <s2> denote the input sequences; labels are illustrated in red.

Table 3 :
Minimal prompts used for ABSTRACT.

Table 4 :
Single dataset accuracies across the GPT-3 model family, using 8 examples.