In-Context Learning Creates Task Vectors

In-context learning (ICL) in Large Language Models (LLMs) has emerged as a powerful new learning paradigm. However, its underlying mechanism is still not well understood. In particular, it is challenging to map it to the"standard"machine learning framework, where one uses a training set $S$ to find a best-fitting function $f(x)$ in some hypothesis class. Here we make progress on this problem by showing that the functions learned by ICL often have a very simple structure: they correspond to the transformer LLM whose only inputs are the query $x$ and a single"task vector"calculated from the training set. Thus, ICL can be seen as compressing $S$ into a single task vector $\boldsymbol{\theta}(S)$ and then using this task vector to modulate the transformer to produce the output. We support the above claim via comprehensive experiments across a range of models and tasks.


Introduction
Large language models have improved dramatically over the last several years.One striking property of these models is that they can learn new rules from very few demonstrations.For instance, a model can be prompted with the input "Apple → Red, Lime → Green, Corn →" and produce the output "Yellow".The model has thus learned a mapping based on just two examples, which it can apply correctly to new examples.This capability, referred to as In-Context Learning (ICL), has been used extensively, yielding impressive empirical results (Brown et al., 2020;Liu et al., 2023;Dong et al., 2022).
Given this success, it is natural to ask what is the underlying mechanism behind ICL.Namely, how does the model internally use the demonstrations S and the query x to produce the required output?Here we approach this question by utilizing the 1 We release our code at https://github.com/roeehendel/icl_task_vectors.In ICL, one provides an LLM with a prompt including demonstrations S of some task, and a query x.The model generates the output for x (here "Yellow").We show that the underlying process can be broken down into two parts: A, a "learning algorithm" (marked in blue), computes a query-agnostic vector θ(S), which we view as a parameter of a function in a hypothesis class.The second part, denoted by f and marked in yellow, is the application of the rule defined by θ on the query x, without direct dependence on S.
concept of a hypothesis class from statistical learning theory (Shalev-Shwartz and Ben-David, 2014).In the learning-theoretic formulation, one typically considers a hypothesis class H, where every element of H is a function h(x; θ), operating on the input x, and specified by a parameter vector θ.For example, if x ∈ R d then the class H could be the set of linear classifiers, defined by a coefficient vector θ as h(x; θ) = θ • x.Learning algorithms seek an element h ∈ H that fits the training set well.This is known as Empirical Risk Minimization.
It is unclear whether ICL operates in such a way because the prediction is performed via T ([S, x]), where T is typically an auto-regressive transformer and [S, x] is a concatenation of the tokens in S and x.Thus, in the general case, it can be an arbitrary function that operates on S and x to produce the output.This can include "non-parametric" methods such as nearest-neighbor.Recent work has begun to explore this question.For example, it was shown that when training a transformer from scratch to perform linear regression in context, the emerging learning algorithm is similar to Stochastic Gradient Descent (Akyürek et al., 2022;von Oswald et al., 2022).However, for LLMs performing more complex natural language tasks, it is not at all clear what the hypothesis space may be.
In this work, we show that on a wide range of tasks, ICL in LLMs can be viewed as working on a very natural hypothesis space.We argue that, given a training set S, the transformer maps it into a "task vector" θ(S) that essentially represents the mapping/rule described in S.2 Namely, given the transformer T and a vector θ, we can construct a new function f (x; θ) that implements the task.The function f is very similar to the original transformer applied to x without demonstrations but instead modulated by θ (see Fig. 2).
Our view is also related to soft prompts (Lester et al., 2021), since both approaches modulate the function of the transformer towards a particular task.However, in ICL, task vectors are calculated in the forward pass rather than being fine-tuned.
Our contributions include proposing a hypothesis-class based mechanistic view of ICL, and conducting experiments to validate our view on a range of publicly available LLMs and a diverse set of tasks.Our results further the understanding of ICL and may have practical implications for the efficient adaptation of LLMs to perform specific tasks.

A Hypothesis Class View of ICL
Motivated by the hypothesis class view of learning theory, our goal is to understand if ICL maps the set of demonstrations S to a function on the query x and how this mapping occurs.Specifically, we seek to see if ICL converts S into θ -the "parameters" of a function within a certain hypothesis space.Our empirical findings suggest this view is applicable, shedding light on the structure of the hypothesis space on which ICL can be viewed to operate.

Theoretical Framework
We use T to denote a decoder-only transformer LLM, S to denote the set of demonstrations (i.e.training examples) used as input to ICL, and x to denote the query that ICL is asked to provide an output for.We use T ([S, x]) to denote the output of ICL on the concatenation of S and x.
To demonstrate that ICL operates within a hypothesis space, we aim to show that its underlying mechanism can be broken down into two parts: • A "Learning Algorithm" (denoted by A) that maps S into a "task vector" θ, independent of the query x.Given that attention layers can access both S and x, this independence is not trivial.• A "Rule Application" (denoted by f ) which maps the query x to the output, based on θ ≡ A(S), without direct dependence on S. Again, this independence is not trivial.Thus, we consider the following mapping from a set of demonstrations and a query to the predicted output: If we can break down the forward pass of the LLM into the above two components, we can view ICL as operating on the following hypothesis class: In the next section we propose an implementation of such a class.

A Proposed Hypothesis Class
There are many possible realizations of the above framework, that correspond to different choices of A and f .We next describe the realization we focus on, which naturally follows from the transformer architecture.We consider an ICL setting as in Fig. 1, where the input ends with a query x (i.e., Corn) followed by an "→" symbol.As mentioned above, we view learning as composed of two steps: calculating a parameter vector θ based on the training sample S, and applying the rule defined by this parameter vector to the query x.A presumably simple way for a transformer to do this is for the first L layers of the → representations to calculate θ and then for the remaining layers to take θ and x as input and produce an output.See Fig. 1.Recall that S and x are accessible to the transformer at any layer, presenting a challenge with our view.
In the following sections, we address this challenge and present experiments validating our view.Namely, we show that we can isolate our proposed A and f in the forward pass of LLMs performing ICL.We also show that the θ vectors are interpretable and correspond to learned tasks.The vector θ is then patched at the same layer during a forward pass of a transformer that only takes x and → as input, to prevent the direct dependence of f on S.

Validity of the Hypothesis Class View
We first show that separating the forward pass into the two distinct components A and f , defined in §2.2, maintains the high accuracy of ICL.

Separating A and f
We face some challenges in a regular forward pass: first, the initial L layers that correspond to A, updating the representations of → to create θ, can attend to the query x.Thus, they may depend on x, creating an unwanted dependence of θ on x.Second, the remaining layers that correspond to f , may directly access S, instead of using only x and θ.
We propose the following procedure to tackle these challenges: to solve the first problem, we introduce a "dummy query" x ′ and calculate the representations of → using that query.We use the representation of → after the first L layers, calculated using x ′ , as the vector θ (as demonstrated on the left side of Fig. 2).An alternative was to block attention to x, but it led to poor performance.To solve the second problem of calculating f (x, θ) without allowing direct dependence on S, we perform a forward pass of the transformer only on x and →,3 and "patch" the θ we previously extracted at the Lth layer of the → (right side of Fig. 2).

Tasks and Models
Tasks We consider a diverse set of 18 tasks across 4 categories: algorithmic, translation, linguistic, and factual knowledge.For simplicity, we limit ourselves to single-token outputs.A representative subset of the tasks is described in Tab. 1.A complete detailed table, as well as more information regarding the data, are provided in § A.1.

Finding L
The mechanism we described in §2.2 has a free parameter -the layer L where A ends and f begins.We use the proposed (A, f ) implementation for different choices of L and evaluate the accuracy on a development set to find the best layer.
Fig. 3 shows the accuracy on the development set, for different choices of L. We focus here on the LLaMA models and include the rest in § A.2. Interestingly, all models exhibit a performance peak at a similar intermediate layer, irrespective of their parameters and layer count differences.

Accuracy of Hypothesis Based Prediction
We next compare the accuracy of the (A, f ) mechanism to that of a regular forward pass performing ICL.For each model and task, we evaluate the following three procedures: • Regular An application of the LLM to the demonstrations S and query x.Namely T ([S, x]), as in regular ICL.
• Hypothesis Our proposed procedure from § 3.1 where A generates θ using a dummy x ′ , and f (•; θ) is applied to x by running the transformer on [x, →] with θ patched at layer L of →. • Baseline A forward pass of the LLM only on x, without demonstrations S. That is, T ([x, →]).This is the same as the application of f from our separated procedure, but without patching θ.Fig. 4 shows the average accuracy across all tasks of these 3 procedures, for each model.Full results are reported in Tab.6 in § A.2. Across all models, our procedure maintains around 80-90% of the accuracy of regular ICL, while the baseline reaches only 10-20%.This shows that our proposed separation to A and f provides a good empirical approximation of the process underlying ICL.

Robustness of Task Vectors
In our setting, θ is derived from S and a dummy query x ′ .It is natural to examine the robustness of θ to variations in these inputs.Intuitively, if it represents the task, it should remain stable across different S and x ′ values.To test this, we use LLaMA 7B to generate 50 task vectors per task with varied S and x ′ and conduct two analyses.
Geometry of θ A t-SNE dimensionality reduction (Fig. 5) reveals that the task vectors form distinct clusters, each containing task vectors of a single task.Fig. 9 further shows proximity between tasks of the same category, strengthening the idea that they encapsulate task understanding.
Variability of θ Fig. 8 shows histograms of distances within and across tasks.It can be seen that vectors within the same task are closer than those between different tasks, indicating that θ is stable within tasks and not highly influenced by x ′ or S.

Dominance of θ Patching
In §3 we prevented f from directly accessing S.However, in a regular forward pass during ICL, the last token can attend to S.Here we verify that even in this case, f mainly uses the task vector θ, without directly accessing the demonstrations S. To this end, we use a pair of tasks, A and B, sharing the input space but differing on the output.We first use a "Regular" forward pass, where we provide the model with demonstrations S for task A (denoted S A ), to verify the model can perform this task using ICL.Then, we do a "Conflicting" forward pass, still providing S A , while injecting θ B .For more details, refer to Fig. 6 in §A.1.In Tab.2, the "Regular" forward pass shows high accuracy on task A (90%+), as anticipated.However, the "Conflicting" forward pass yields high accuracy on task B, corresponding to the injected task vector θ.This implies that the model mainly relies on θ, largely disregarding the demonstrations S for task A. We note that the accuracy on task B is slightly low, likely consistent with the performance dip seen in Fig. 6, and potentially further affected by the presence of S.

Interpreting θ
The learned vector θ intuitively captures information about the task demonstrated by S.Here we provide evidence supporting this interpretation.Since θ is an intermediate hidden state of the transformer, we can employ a vocabulary projection method (nostalgebraist, 2020; Dar et al., 2022).Namely, we examine the top tokens in the distribution over the vocabulary induced by the hidden state.
Tab. 3 shows the top tokens for three tasks for LLaMA 13B (more models and tasks are provided in Tab.7 in §A).In multiple cases, we observe tokens that directly describe the task.Importantly, these terms never explicitly appeared in the context.For example in the task of translation from French to English, we observe tokens such as "English" and "translate".This supports our view that θ carries significant, non-trivial semantic information about the task.

Related Work
Emergence of ICL A key question with ICL is how it emerges as a capability from pre-training the LLMs.Levine et al. (2022) provides results in this direction that highlight the importance of training data structure.Xie et al. use probabilistic analysis and model pre-training data using Hidden Markov Models to theoretically explain the emergence of ICL, while Chan et al. (2022) empirically explore the effect of several distributional properties of the pre-training data.

Limitations
We study relatively simple tasks, whereas ICL can learn to perform more complex tasks, such as solving arithmetic reasoning problems.It remains to be seen if and how the mechanisms we observe here will translate to these cases.E.g., our approach focuses on cases where a single task vector suffices, while more complex ICL cases may require more elaborate parameterization.We also focus on tasks where the output is a single token, while some other tasks require multi-token outputs.
Finally, as noted above, we do not provide a mechanistic explanation for how the task vector is formed or how it is used.Namely, we do not explain how the transformer performs these calculations using its parameters.Demonstrations (  ) b\d ?

Conflicting
Figure 6: Conflicting tasks experiment.In the "Regular" scenario (top), the model is simply provided with demonstrations S A for Task A (e.g.outputting the previous letter in the alphabet).In the "Conflicting" scenario (bottom), the model is still provided with demonstrations for Task A, but we inject a task vector θ(S B ) from a conflicting Task B (e.g.outputting the next letter in the alphabet).

Figure 1 :
Figure1: ICL as learning in a Hypothesis Class.In ICL, one provides an LLM with a prompt including demonstrations S of some task, and a query x.The model generates the output for x (here "Yellow").We show that the underlying process can be broken down into two parts: A, a "learning algorithm" (marked in blue), computes a query-agnostic vector θ(S), which we view as a parameter of a function in a hypothesis class.The second part, denoted by f and marked in yellow, is the application of the rule defined by θ on the query x, without direct dependence on S.

Figure 2 :
Figure2: Separating A and f .To make θ independent of the query x, we use a dummy query (x ′ = Plum) and use the representation of → at the L th layer as θ.The vector θ is then patched at the same layer during a forward pass of a transformer that only takes x and → as input, to prevent the direct dependence of f on S.

Figure 3 :
Figure 3: Accuracy for each choice of the intermediate layer L, averaged across all tasks.Solid lines show average values, and shaded areas standard deviations.

Figure 4 :
Figure 4: Average accuracy across all tasks for each model, using each of the three procedures: Baseline, Regular and Hypothesis.

Figure 5 :
Figure 5: A t-SNE plot of task vectors.A 2D t-SNE plot visualizing 50 task vectors for each task, each generated from a different choice of S and x ′ using LLaMA 7B.Points are color-coded according to the task.Each task can be seen to form its own distinct cluster.

Figure 7 :
Figure 7: Accuracy for each choice of L (the intermediate layer where the task vector is injected), averaged across all tasks.The solid line represents the average value, and the shaded area depicts the standard deviation.

Figure 8 :
Figure 8: Task Vector Variability.For each task, two histograms are shown: (blue) the distribution of distances between different task vectors of this task, created from different S and x ′ ; (orange) the distribution of distances between task vectors of the task and of other tasks.

Figure 9 :
Figure9: A 2D t-SNE plot, visualizing 50 task vectors for each task, each generated from a different choice of S and x using LLaMA 7B.Points are color-coded according to task category, such as algorithmic or translation.Each task can be seen to form its own distinct cluster.The labels provide the full name of the task in the cluster.

Table 1 :
4A representative subset of the tasks used in the study with input → output examples.

Table 2 :
Conflicting tasks experiment results.The model's accuracy on the relevant task (A in "Regular" and B in "Conflicting") is displayed for both scenarios.

Table 3 :
The top 10 tokens in the distribution induced by the task vector, for one task per category.

Table 5 :
The tasks used in the study with input → output examples.