Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

Transformer-based language models (LMs) are at the core of modern NLP, but their internal prediction construction process is opaque and largely not understood. In this work, we make a substantial step towards unveiling this underlying prediction process, by reverse-engineering the operation of the feed-forward network (FFN) layers, one of the building blocks of transformer models. We view the token representation as a changing distribution over the vocabulary, and the output from each FFN layer as an additive update to that distribution. Then, we analyze the FFN updates in the vocabulary space, showing that each update can be decomposed to sub-updates corresponding to single FFN parameter vectors, each promoting concepts that are often human-interpretable. We then leverage these findings for controlling LM predictions, where we reduce the toxicity of GPT2 by almost 50%, and for improving computation efficiency with a simple early exit rule, saving 20% of computation on average.


Introduction
How do transformer-based language models (LMs) construct predictions? We study this question through the lens of the feed-forward network (FFN) layers, one of the core components in transformers (Vaswani et al., 2017). Recent work showed that these layers play an important role in LMs, acting as memories that encode factual and linguistic knowledge (Geva et al., 2021;Da et al., 2021;Meng et al., 2022). In this work, we investigate how outputs from the FFN layers are utilized internally to build predictions.
We begin by making two observations with respect to the representation of a single token in the input, depicted in Fig. 1 Figure 1: Illustration of our findings. Feed-forward layers apply additive updates (A) to the token representation, which can be interpreted as a distribution over the vocabulary (B). An update is a set of sub-updates induced by parameter vectors (C), each can be interpreted as a concept in the vocabulary space (D).
induces an additive update to the token representation ( Fig. 1,A). Second, the token representation across the layers can be translated at any stage to a distribution over the output vocabulary (Geva et al., 2021) (Fig. 1,B). We reason that the additive component in the update changes this distribution ( §2), namely, FFN layers compute updates that can be interpreted in terms of the output vocabulary.
We then decompose the FFN update ( §3), interpreting it as a collection of sub-updates, each corresponding to a column in the second FFN matrix ( Fig. 1,C) that scales the token probabilities in the output distribution. Through a series of experiments, we find that (a) sub-update vectors across the entire network often encode a small-set of human-interpretable well-defined concepts, e.g. "breakfast" or "pronouns" ( §4, Fig. 1,D), and (b) FFN updates rely primarily on token promotion (rather than elimination), namely, tokens in the top of the output distribution are those pushed strong enough by sub-updates ( §5). Overall, these findings allow fine-grained interpretation of the FFN operation, providing better understanding of the prediction construction process in LMs.
Beyond interpretation, our findings also have practical utility. In §6.1, we show how we can intervene in the prediction process, in order to manipulate the output distribution in a direction of our choice. Specifically, we show that increasing the weight of only 10 sub-updates in GPT2 reduces toxicity in its generations by almost 50%. Also, in §6.2, we show that dominant sub-updates provide a useful signal for predicting an early exit point, saving 20% of the computation on average, without changing the model's prediction.
In conclusion, we investigate the mechanism in which FFN layers update the inner representations of transformer-based LMs. We propose that the FFN output can be viewed as a collection of updates that promote concrete concepts in the embedding space, and that these concepts are often interpretable for humans. Our findings shed light on the prediction construction process in modern LMs, suggesting promising research directions for interpretability, control, and efficiency.

Token Representations as Evolving
Distributions Over the Vocabulary Modern LMs (Baevski and Auli, 2019;Radford et al., 2019;Brown et al., 2020) are transformer models primarily trained to predict the next token probability for a given input. Such LMs are composed of intertwined multi-head self-attention (MHSA) layers and FFN layers (Vaswani et al., 2017), with residual connections (He et al., 2016) between each pair of consecutive layers. The LM prediction is obtained by projecting the output vector from the final layer to an embedding matrix E ∈ R |V|×d , with a hidden dimension d, to get a distribution over a vocabulary V (after softmax).
Given a sequence w = w 1 , ..., w t of input tokens, the model creates a contextualized representation x i ∈ R d for each token w i ∈ w, that is being updated throughout the layers. In this work, we analyze the updates applied by the FFN layers and how they construct the model prediction. Concretely, each FFN layer = 1, ..., L processes x i and produces an output o i , which is then added to x i to yield an updated representationx i : The updated representationx i then goes through a MHSA layer, 2 yielding the input x +1 i for the next FFN layer. The evolving representation in this process (i.e. x i →x i , ∀ ) can be viewed as an information stream that is being processed and updated by the layers (Elhage et al., 2021). The output probability distribution is obtained from the final representation of the token, i.e., To analyze the FFN updates, we read from the representation at any layer a distribution over the output vocabulary, by applying the same projection as in Eq. 1 (Geva et al., 2021): Note thatp L i = y. Importantly, by linearity: implying that o i can be interpreted as an additive update in the vocabulary space. However, we find that the projection of the FFN output Eo i to the vocabulary is not interpretable ( §4). In this work, we take this a step further, and decompose the update o i into a set of smaller sub-updates. By projecting the sub-updates to the vocabulary we find that they often express human-interpretable concepts.
In the rest of the paper, we focus on FFN updates to the representation of a single token in the sequence. For brevity, we omit the token index, i.e.
x := x i and p := p i .

The FFN Output as a Collection of Updates to the Output Distribution
We now decompose the FFN output, and interpret it as a set of sub-updates in the vocabulary space.

FFN Outputs as Linear Vector Combinations.
Each FFN at layer consists of two linear transformations with a point-wise activation function in between (bias terms are omitted): where W K , W V ∈ R dm×d are parameter matrices, and f is a non-linearity function. Previous work proposed this module can be cast as an emulated neural key-value memory (Sukhbaatar et al., 2015(Sukhbaatar et al., , 2019, where rows in W K and columns in W V are viewed as keys and values, respectively. For an input x , The keys produce a vector of activationcoefficients m := f W K x ∈ R dm , that weighs the corresponding values in W V . Denoting by k i the i-th row of W K and by v i the i-th column of W V , we can then use the following formulation: Therefore, a FFN update can be viewed as a weighted collection of sub-updates, each corresponding to a value vector in the FFN output. Interpreting Sub-Updates in the Vocabulary Space. Consider a sub-update m i v i for a given input, we can estimate its influence on the representation x (before the FFN update) by analyzing the change it induces on the output distribution. Concretely, we isolate the effect of m i v i on the probability p w of w ∈ V: 3 where e w is the embedding of w, and Z · is the constant softmax normalization factor. This implies that each sub-update m i v i introduces a scaling factor to the probability of every token w based on its dot product with e w . Specifically, having e w · m i v i > 0 increases the probability of w, and having e w · m i v i < 0 decreases it. This scaling factor can be split into two parts: • The term e w · v i can be viewed as a static score of w that is independent of the input to the model. Thus, the projection r i = Ev i ∈ R |V| induces a ranking over the vocabulary that allows comparing the scores by v i w.r.t different tokens.
• The term m i is the dynamic coefficient of v i , which is fixed for all tokens for a given input. Thus, these coefficients allow comparing the contribution of value vectors in a specific update.
Overall, the scaling factor e w · m i v i can be viewed as the effective score given by a value v i to a token w for a given input.
In the next sections, we use these observations to answer two research questions of (a) What information is encoded in sub-updates and what tokens do they promote? ( §4) and (b) How do FFN updates build the output probability distribution? ( §5) Terminology. In the rest of the paper, we refer to the vectors v i as value vectors, and to their weighted form m i v i as sub-updates. A transformer LM with L = 10, d m = 3000 will have 30, 000 value vectors, and every token that passes through the transformer will weight these value vectors differently, resulting in 30, 000 sub-updates, where only a few of the sub-updates have high weights.

Sub-Updates Encode Concepts in the Embedding Space
We evaluate whether projection to the vocabulary provides a meaningful way to "read" FFN updates, and the extent to which sub-updates are interpretable based on their projections. To this end, we manually inspect the top-scoring tokens by real and randomly-generated value vectors and check whether they express interpretable concepts. Concretely, we consider two representative LMs (details below), and for each value vector v i we compute a ranking over the vocabulary by sorting the projection r i ( §3). Then, we try to detect patterns in the top-scoring tokens of each value vector.
Concepts Annotation Task. We let experts (NLP graduate students) annotate concepts by identifying common patterns among the top-30 scoring tokens of each value vector. For a set of tokens, the annotation protocol includes three steps of: (a) Identifying patterns that occur in at least 4 tokens, (b) describing each recognized pattern, and (c) classifying each pattern as either "semantic" (e.g., mammals), "syntactic" (e.g., past-tense verbs), or "names". The last class was added only for WIKILM (see below), following the observation that a large portion of the model's vocabulary consists of names. The complete instructions and a fully annotated example can be found in App. A.2.     (Hendrycks and Gim pel, 2016), while WIKILM uses ReLU, and in contrast to GPT2, WIKILM does not apply layer normalization after FFN updates. WIKILM defines d = 1024, d m = 4096 and GPT2 defines d = 768, d m = 3072 , resulting in a total of 65k and 36k value vectors, respectively. For our experiments, we sample 10 random vectors per layer from each model, yielding a total of 160 and 120 vectors to analyze from WIKILM and GPT2, respectively.

Projection of Sub-Updates is Meaningful
Real vs. Random Sub-Updates. We validate our approach by comparing the projections of value vectors and 10 random vectors, initialized from a normal distribution with the empirical mean and standard deviation of the real vectors. We observe that a substantially higher portion of top-tokens were associated to a concept in value vectors compared to the random ones: 55.1% vs. 22.7% in WIKILM, and 37% vs. 16% in GPT2. Also, in both models, the average number of concepts per vector was > 1 in the value vectors compared to ∼ 0.5 in the random ones. Notably, no semantic nor syntactic concepts were identified in WIKILM, and in GPT2, only 4% of the tokens were marked as semantic concepts in the random vectors, versus 24.9% in the value vectors.
Updates vs. Sub-Updates. To justify the FFN output decomposition, we also analyze the projections of 10 random FFN outputs per layer. In WIK-ILM (GPT2), 39.4% (46%) of the tokens were associated with concepts, but for 19.7% (34.2%) the concept was "stopwords/punctuation". Moreover, we observe very few concepts (< 4%) in the last two layers of WIKILM. We account this to extreme sub-updates that dominate the layer's output ( §5.2). Excluding these concepts results in a considerably lower token coverage in updates' projections compared to sub-updates: 19.7% vs. 55.1% in WIKILM, and 11.8% vs. 36.7% in GPT2.
Overall, this shows that projecting sub-updates to the vocabulary provides a meaningful interface to the information they encode. Moreover, decomposing the FFN outputs is necessary for finegrained interpretation of sub-updates. Fig. 2 shows a breakdown of the annotations across layers, for WIKILM and GPT2. In both models and across all layers, a substantial portion (40%-70% in WIKILM and 20%-65% in GPT2) of the tokens were associated with well-defined concepts, most of which were classified as "semantic". Also, we observe that the top-scoring tokens of a single value vector were associated with 1.5 (WIKILM) and 1.1 (GPT2) concepts on average, showing that sub-updates across all layers encode a small-set of well-defined concepts. Examples are in Tab. 1.

Sub-Update Projections are Interpretable
These findings expand on previous results by Geva et al. (2021), who observed that value vectors in the upper layers represent next-token distributions that follow specific patterns. Our results, which hold across all the layers, suggest that these vectors represent general concepts rather than prioritizing specific tokens.
Underestimation of Concept Frequency. In practice, we find that this task is hard for humans, as tokens often correspond to uncommon words, homonyms, or sub-words, and some patterns necessitate world knowledge (e.g. recognizing "villages in Europe near rivers") or linguistic background (e.g. identifying negative polarity items). This often leads to time-consuming searches over the Web 4 and more importantly, undetectable patterns, suggesting that the overall results are an underestimation of the true concept frequency. Providing additional context and token-related information are possible directions for improving the annotation protocol, which we leave for future work.
Implication for Controlled Generation. If subupdates indeed encode concepts, then we can not only interpret their contribution to the prediction, but also intervene in this process, by increasing the weights of value vectors that promote tendencies of our choice. We demonstrate this in §6.1.

FFN Updates Promote Tokens in the Output Distribution
We showed that sub-updates often encode interpretable concepts ( §4), but how do these concepts construct the output distribution? In this section, we show that sub-updates systematically configure the prediction through promotion and saturation of candidate tokens.
p : cow, cat, dog, goat, horse, bear p : dog, cat, goat, horse, cow, bear Saturation: dog is promoted from rank 3 in p to rank 1 iñ p , to be the top-candidate until the last layer.
p : cow, cat, dog, goat, horse, bear p : dog, cat, goat, horse, cow, bear Elimination: cow is eliminated from rank 1 in p to 5 inp . Top, saturation Table 3: Maximum, mean, and minimum scores of reference tokens in saturation and elimination events, by the 10 most dominant and 10 random sub-updates.

Promoted Versus Eliminated Candidates
Every sub-update m i v i either increases, decreases, or does not change the probability of a token w, according to the score e w · m i v i ( §3). This suggests three mechanisms by which tokens are pushed to the top of the output distribution -promotion, where sub-updates increase the probability of favorable tokens, elimination, where sub-updates decrease candidate probabilities, or a mixture of both.
To test what mechanism holds in practice, we analyze the scores sub-updates assign to top-candidate tokens by the representation. To simplify the analysis, we focus on changes induced by the 10 most dominant sub-updates in each layer, that is, the 10 sub-updates m i v i with the largest contribution to the representation, as measured by For the experiments, we use a random sample of 2000 examples from the validation set of WIKITEXT-103, 5 which both WIKILM and GPT2 did not observe during training. As the experiments do not involve human annotations, we use a larger GPT2 model with L = 24, d = 1024, d m = 4096.
We start by comparing the scores to a reference token in the context of two types of events: • Saturation (Tab. 2, up): The update p →p where the final token predicted by the model (i.e., w = argmax(y)) was promoted to be the top candidate until the last layer. We analyze saturation events induced by the FFN before the last layer, covering 1184 and 1579 events in WIKILM and GPT2, respectively.
• Elimination (Tab. 2, bottom): The update p → p with the largest increase in the top candidate's rank, i.e. where the top candidate was dropped behind other candidates to have a rank > 1. Overall, our analysis covers 1909 (WIKILM) and 1996 (GPT2) elimination events.
We compute the mean, maximum, and minimum scores of the reference token by the 10 most dominant sub-updates in each event, and average over all the events. As a baseline, we compute the scores by 10 random sub-updates from the same layer.
Tab. 3 shows the results. In both models, tokens promoted to the top of the distribution receive higher maximum scores than tokens eliminated from the top position (1.2 → 0.5 in WIKILM and 8.5 → 4.0 in GPT2), indicating they are pushed strongly by a few dominant sub-updates. Moreover, tokens eliminated from the top of the distribution receive near-zero mean scores, both by dominant and random sub-updates, suggesting they are not eliminated directly. In contrast to promoted tokens, where the maximum scores are substantially higher than the minimal scores (1.2 vs. −0.8 in WIKILM and 8.5 vs. −4.9 in GPT2), for eliminated tokens, the scores are similar in their magnitude (±0.5 in WIKILM and 4.0 vs. −3.6 in GPT2). Last, scores by random sub-updates are dramatically lower in magnitude, showing that our choice of sub-updates is meaningful and that higher coefficients translate to greater influence on the output distribution.  Table 4: Value vector groups in WIKILM and GPT2 promoting common ("C") and unlikely ("U") tokens.

This suggests that FFN updates work in a pro-
For each group, we show its size, the layers covered by > 70% of its vectors, and example tokens it promotes.
motion mechanism, where top-candidate tokens are those being pushed by dominant sub-updates.

Value Updates Across Layers
To analyze the FFN operation in different layers, we break down the top-candidate scores per layer. Formally, let w = argmax(p ) be the top candidate at layer (before the FFN update) for a given input, we extract the scores e w · m i v i by the 10 most dominant sub-updates and compute the mean, minimum and maximum scores over that set. Fig. 3 shows that, in both models, until the last few layers (23-24 in GPT2 and 14-16 in WIKILM), maximum and minimum scores are distributed around non-negative mean scores, with prominent peaks in maximum scores (layers 3-5 in GPT2 and layers 4-11 in WIKILM). This suggests that the token promotion mechanism generally holds across layers. However, scores diverge in the last layers of both models, with strong negative minimum scores, indicating that the probability of the top-candidate is pushed down by dominant sub-updates. We next show that these large deviations in positive and negative scores (Fig. 3, dashed lines) result from the operation of small sets of functional value vectors.
Extreme Sub-Updates. To analyze the extreme FFN updates, we first cluster the value vectors, in order to discover high-level trends. To this end, we use agglomerative clustering (Müllner, 2011) to learn 10, 000 clusters for each model, based on the cosine distance matrix D, where , · · · , d m }, ∀ 1 , 2 ∈ {1, · · · , L}. 6 Then, we search for clusters that are frequently active in extreme updates, by (a) extracting sub-updates where the scores for the top-candidate pass a certain threshold (±10 for GPT2 and ±5 for WIKILM), and (b) counting the appearances of each cluster in the layer sub-updates.
In both models, a small set of homogeneous clusters is correlated with the extreme sub-updates, divided into two main groups of value vectors (Tab. 4): Vectors in the upper layers that promote generally unlikely tokens (e.g. rare tokens), and vectors that are spread over all the layers and promote common tokens (e.g. stopwords).
These clusters, which cover only a small fraction of the value vectors (1.7% in GPT2 and 1.1% in WIKILM), are jointly active 7 in FFN updates and account for the extreme scores shown in Fig. 3. Moreover, they are mostly active for examples where the input sequence has ≤ 3 tokens, or when the target token can be easily inferred from the context (e.g. end-of-sentence period), suggesting that these value vectors might configure "easy" model predictions. More interestingly, the value vectors that promote unlikely tokens can be viewed as "saturation vectors", which propagate the distribution without changing the top tokens. Indeed, these vectors are in the last layers, where often the model already stores its final prediction (Geva et al., 2021).

Zero-Shot Toxic Language Suppression
LMs are known to generate toxic, harmful language, which damages their usefulness (Bender et al., 2021;McGuffie and Newhouse, 2020;Wallace et al., 2019). We utilize our understanding of the FFN layers in LMs to create a simple, intuitive method for toxic language suppression.
Method. If transformers indeed operate in a promotion mechanism, we reason that we can decrease 6 We experimented with k = 3e 2 , 1e 3 , 3e 3 , 1e 4 , 3e 4 , and choose k = 1e 4 based on manual inspection. 7 χ 2 test with p-value 0.001. model toxicity by "turning on" non-toxic subupdates. We find value vectors that promote safe, harmless concepts by extracting the top-tokens in the projections of all the value vectors and either (a) manually searching for vectors that express a coherent set of positive words (e.g. "safe" and "thank"), or (b) by grading the tokens with the Perspective API 8 and selecting non-toxic value vectors (see details in App. A.4). We turn on these value vectors by setting their coefficients to 3, a relatively high value according to Fig. 3. We compare our method with the following baselines: 1. Self-Debiasing (SD) (Schick et al., 2021): SD generates a list of undesired words for a given prompt by appending a self-debiasing input, which encourages toxic completions, and calculating which tokens are promoted compared to the original prompt. These undesired words' probability are then decreased according to a decay constant λ, which we set to 50 (default).
2. WORDFILTER: We prevent GPT2 from generating words from a list of banned words by setting any logits that would result in a banned word completion to −∞ (Gehman et al., 2020).
Evaluation. We evaluate our method on the challenging subset of REALTOXICPROMPTS (Gehman et al., 2020), a collection of 1,225 prompts that tend to yield extremely toxic completions in LMs, using the Perspective API, which grades text according to six toxicity attributes. A score of > 0.5 indicates that the text is considered toxic w.r.t to the attribute. Following Schick et al. (2021), we generate continuations of 20 tokens with beam search of beam size 3. Additionally, we compute perplexity to account for changes in LM performance.
Results. Finding the non-toxic sub-updates manually was intuitive and efficient (taking < 5 minutes). Tab. 5 shows that activation of only 10 value vectors (0.01%) substantially decreases toxicity, outperforming both SD and WORDFILTER. Toxicity dropped by 47% with our 10 manually picked vectors, compared to 37% with SD. However, our method resulted in a perplexity increase that was greater than this induced by SD, though the increase was still relatively small.

Self-Supervised Early Exit Prediction
The recent success of transformer-based LMs in NLP tasks has resulted in major production cost  increases (Schwartz et al., 2020a), and thus has spurred interest in early-exit methods that reduce the incurred costs (Xu et al., 2021). Such methods often use small neural models to determine when to stop the execution process (Schwartz et al., 2020b;Elbayad et al., 2020;Hou et al., 2020;Xin et al., 2020Xin et al., , 2021Li et al., 2021;Schuster et al., 2021). We harness our findings to create a simple and effective early exiting method for LM inference, which does not involve any external model training. We posit that the dominant FFN sub-updates can indicate whether a saturation event ( §5.2) occurs. We test this on WIKILM, where saturation events occur across all layers (statistics for WIKILM and GPT2 are in App. A.5).
Method. We devise a simple prediction rule based on a nearest-neighbours approach, using 10k validation examples from WIKITEXT-103. First, for every example, we map the top-10 dominant sub-updates at each layer to their corresponding clusters. Then, for every layer , we split all the sets of clusters at that layer into two sets, T and N , based on whether saturation occurred or not (e.g., T 5 stores all the cluster sets that were active in a saturation event at layer 5). Given the top-10 clusters of an unseen example at some layer , we consider a higher overlap with T than with N , ∀ > as a signal for early exit. Thus, during inference, we propagate the input example through the layers, and compute at each layer the intersection size between its top-10 active clusters and each of T and N , ∀ > . If the average and maximal intersection with T exceeds those with N , ∀ > , we halt the computation and declare early exiting. 9 Baselines. To highlight the utility of sub-updates for early exit, we train layer-wise binary classifiers  over the representation and FFN updates x , o , andx , using logistic regression. As in our method, the labels are determined according to saturation events in the training data (see App. A.5). During inference, we execute the computation through the layers, and halt according to the layer classifier.
Evaluation. We evaluate each method by accuracy, i.e., the portion of examples for which exiting at the predicted layer yields the final model prediction, and by computation efficiency, measured by the amount of saved layers for examples with correct prediction. We run each method with five random seeds and report the average scores.
Results. Tab. 6 shows that our method exhibits the highest accuracy of 94.1%, demonstrating the utility of sub-updates for predicting saturation events. Moreover, just by observing the dominant sub-updates, it saves 20% of the computation on average, without changing the prediction. In addition, early exit prediction based on the FFN outputs (o ) yielded the largest computation reduction of 28.7%, surpassing the prediction rules relying on the representation (x ,x ) by ∼7%. This further supports our hypothesis that FFN updates play a functional role in saturating the prediction ( §5.2).

Related Work
The lack of interpretability of modern LMs has led to wide interest in understanding their prediction construction process. Previous works mostly focused on analyzing the evolution of hidden rep-resentations across layers (Voita et al., 2019), and probing the model with target tasks (Yang et al., 2020;Clark et al., 2019;Tenney et al., 2019;Saphra and Lopez, 2019). In contrast, our approach aims to interpret the model parameters and their utilization in the prediction process. More recently, a surge of works have investigated the knowledge captured by the FFN layers (Da et al., 2021;Jiang et al., 2020;Dai et al., 2021;Yao et al., 2022;Meng et al., 2022;Wallat et al., 2020). These works show that the FFN layers store various types of knowledge, which can be located in specific neurons and edited. Unlike these works, we focus on the FFN outputs and their contribution in the prediction construction process.
Last, our interpretation of FFN outputs as updates to the output distribution relates to recent works that interpreted groups of LM parameters in the discrete vocabulary space (Geva et al., 2021;Khashabi et al., 2021), or viewed the representation as an information stream (Elhage et al., 2021).

Discussion
Conclusions. Understanding the inner workings of transformers is valuable for explainability to endusers, for debugging predictions, for eliminating undesirable behavior, and for understanding the strengths and limitations of NLP models. The FFN is an understudied core component of transformerbased LMs, which we focus on in this work.
We study the FFN output as a linear combination of parameter vectors, termed values, and the mechanism by which these vectors update the token representations. We show that value vectors often encode human-interpretable concepts and that these concepts are promoted in the output distribution.
Our analysis of the internals of auto-regressive transformers provides a more detailed understanding of how transformer-based LMs make predictions, and suggests new research directions of model interpretability, control, and efficiency, at the level of individual vectors.
Future Work. Our study focused on the role that individual value vectors play in forming model predictions. Future work should study the interplay between these vectors and other components in the network, such as value vectors and attentionheads or multiple value vectors. Another research avenue that follows from our work would investigate how sub-updates integrate factual knowledge from the FFN layers in model predictions. Finally, our annotation effort was made for the evaluation of our hypothesis that sub-updates encode humaninterpretable concepts. Scaling our annotation protocol would enable a more refined map of the concepts, knowledge and structure captured by LMs.

Ethical Considerations
Our work in understanding the role that singlevalues play in the inference that transformer-based LMs perform potentially improves their transparency, while also providing useful control applications that save energy (early-exit prediction) and increase model harmlessness (toxic language suppression). It should be made clear that our method for toxic language suppression only reduces the probability of toxic language generation and does not eliminate it. As such, this method (as well as our early-exit method) should not be used in the real world without further work and caution.

A.1 Value Vectors Projection Method
Our interpretation method of sub-updates is based on directly projecting value vectors to the embedding matrix, i.e. for a value v and embedding matrix E, we calculate Ev ( §4). However, in some LMs like GPT2, value vectors in each layer are added to the token representation followed by a layer normalization (LN) (Ba et al., 2016). This raises the question whether "reading" vectors that are normalized in the same manner as the representation would yield different concepts.
To test that, we compare the top-30 scoring tokens by Ev i and by E · LayerNorm(v i ), for i = 1, ..., d m and = 1, ..., L, using Intersection over Union (IoU). As a baseline, we also compare Ev i with random vectors, initialized from a normal distribution with the empirical mean and standard deviation of the value vectors. Fig. 4 shows that LN does not change the projection substantially, with an overlap of 64.5% of the top-30 tokens on average, suggesting that the same concepts are promoted in both cases. This is in contrast to random values, which produce a ∼ 0% overlap on average.

A.2 Concepts Annotation
We analyze the concepts encoded in sub-updates, by projecting their corresponding value vectors to the embedding matrix and identifying repeating patterns in the top-30 tokens. Pattern identification was performed by experts (NLP graduate students), following the instructions presented in Tab. 5.
For value vectors in WIKILM, which uses a word-level vocabulary with many uncommon words, we additionally attached a short description field for each token that provides context about the meaning of the word. For the description of a token w, we first try to extract the definition of w from Wordnet. 10 If w does not exist in Wordnet, as often happens for names of people and places, we then search for w in Wikipedia 11 and extract a short (possibly noisy) description if the query was successful. A complete annotation example Tab. 7.

A.3 Sub-Update Contribution in FFN Outputs
In this section, we justify our choice along the paper of looking at the top-10 dominant sub-updates. The contribution of a sub-update m i v i to the FFN output is: namely, its relative weight compared to the overall sum of weights of all sub-updates. The overall contribution of the top-10 dominant sub-updates is computed by summing their contributions. Note that we take the absolute value of the coefficients |m i |, since some activation functions (e.g. GeLU (Hendrycks and Gimpel, 2016) in GPT2), can result in negative values of m i . Empirically, we observe that in some cases subupdates with negative coefficients do appear as part of the 10 most dominant sub-updates in GPT2. We further attribute this to the success of GeLU in transformer models (Shazeer, 2020), as it increases the expressiveness of the model by allowing reversing the scores value vectors induce over the vocabulary. Fig. 6 depicts the contribution of the top-10 dominant sub-updates per layer for WIKILM and GPT2, using 2000 random examples from the WIKITEXT-103 validation set. Clearly, for all the layers, the contribution of the dominant subupdates exceeds the contribution of random subupdates. Observe that, even though they cover only 0.24% of the value vectors, the contribution of dominant sub-updates is typically around 5%, and in some layers (e.g. layers 8-16 in WIKILM and layer 1 in GPT2) it reaches over 10% of the total contribution. This demonstrates that analyzing the top-10 dominant sub-updates can shed light on the way predictions are built through the layers.

In this task, you are given a list of 30 words in English, and the goal is to identify repetitive patterns occurring in the words.
Patterns can be semantic (e.g. animals, 3-digit numbers, names of Indian actors, and time-related words) or syntactic (e.g. connectives, plurals, words starting with "dis-", and verbs in present progressive tense). You should only count patterns that occur in at least 4 words (i.e. if you notice a pattern that occurs only in 3 words, then please ignore it).
To complete the task, please do the following: 1. Give an ID to every identified pattern (1,2,...) 2. Assign a pattern ID to every word in the list, or -1/leave empty if no pattern applies to the word. 3. For every identified pattern specify whether the pattern is semantic or syntactic and (optional) write a short description of the pattern.
Please note that some of the words might be uncommon words that you are not familiar with. In such cases, you will need to do a quick search over the Web to understand the meaning of words.

A.4 Toxic Language Suppression Details
The 10 manually selected value vectors were found by searching for non-toxic words, such as "safe" and "peace", among the top-30 tokens in the vector projections to the vocabulary. We selected a small set of 10 value vectors whose top-scoring tokens were coherent and seemed to promote different kinds of non-toxic tokens. The list of manually picked vectors is provided in Tab. 8. Importantly, the search process was a one-time effort that took < 5 minutes in total. We chose the value vectors in a greedy-manner, without additional attempts to optimize our choice.
To select 10 non-toxic value vectors based on an automatic toxicity metric, we used the Perspective API. Concretely, we concatenated the top-30 tokens by each value vector and graded the resulting text with the toxicity score produced by the API. Then, we sampled 10 random vectors with a toxicity score < 0.1 (a score of < 0.5 indicates a non-toxic text).

A.5 Early Exit Details
This section provides further details and analysis regarding our early exit method and the baselines we implemented.
Method Implementation. We consider 90% of the 10k examples for constructing T and N , and the remaining 10% examples are considered as the testing set. We used k = 2e 2 to cluster the top-10 dominant value vectors, but observed that other k values yielded similar results.
Baselines' Implementation. We train each binary classifier using 8k training examples, based on the standardized forms of each feature vector. We considered a hyperparameter sweep, using 8-fold cross-validation, with l2 or l1 regularization (lasso (Tibshirani, 1996) or ridge (Hoerl and Kennard, 1970)), regularization coefficients C ∈ {1e −3 , 1e −2 , 1e −1 , 1, 1e 1 , 1e 2 , 1e 3 }, and took the best performing model for each layer. We also used a inversely proportional loss coeffi-patterns word description 1 front the side that is forward or prominent 1 ahead having the leading position or higher score in a contest 1 forward the person who plays the position of forward in certain games, such as basketball, soccer, or hockey 1 preceded be earlier in time; go back further 1 Before earlier in time; previously 1 before earlier in time; previously 1 rear the back of a military formation or procession 1 fore front part of a vessel or aircraft 2 Name a language unit by which a person or thing is known 1 Past the time that has elapsed 1 prior the head of a religious order; in an abbey the prior is next below the abbot 1 anterior a tooth situated at the front of the mouth 1 upperparts standard terms for unambiguous description of relative placement of body parts 1 lead an advantage held by a competitor in a race 1 backwards at or to or toward the back or rear 1 aft (nautical, aeronautical)   cient according to the class frequencies.
In order to achieve high accuracy, we further calibrate a threshold per classifier for reaching the maximal F 1 score for each layer. This calibration is done after training each classifier, over a set of 1000 validation examples.
Frequency of Saturation Events. We investigate the potential of performing early exit for WIK-ILM and GPT2. Tab. 9 and 10 depict the frequency of saturation events per layer, considering 10k examples from the WIKITEXT-103 validation set, for WIKILM and GPT2, respectively. In GPT2, 34.15% of the examples require the full computation using all the model layers, while for WIK-ILM, this holds for only 15.22% of the examples. Notably, early fixation events in GPT2 are less common than in WIKILM, possibly due to the larger number of layers the prediction construction is spread over. Hence, we use WIKILM for our experiments, as it has significantly higher computation saving potential, as well as more saturation events per layer.  Table 8: The 10 manually picked value vectors used for toxic language suppression and the top-10 tokens in their projection to the vocabulary. Repetitions in the projections are a result of special characters not being shown. These vectors were found by manually searching for non-toxic words such as "safe" and "peace" in the projections to the vocabulary.