Analyzing Transformers in Embedding Space

Understanding Transformer-based models has attracted significant attention, as they lie at the heart of recent technological advances across machine learning. While most interpretability methods rely on running models over inputs, recent work has shown that a zero-pass approach, where parameters are interpreted directly without a forward/backward pass is feasible for some Transformer parameters, and for two-layer attention networks. In this work, we present a theoretical analysis where all parameters of a trained Transformer are interpreted by projecting them into the embedding space, that is, the space of vocabulary items they operate on. We derive a simple theoretical framework to support our arguments and provide ample evidence for its validity. First, an empirical analysis showing that parameters of both pretrained and fine-tuned models can be interpreted in embedding space. Second, we present two applications of our framework: (a) aligning the parameters of different models that share a vocabulary, and (b) constructing a classifier without training by “translating” the parameters of a fine-tuned classifier to parameters of a different model that was only pretrained. Overall, our findings open the door to interpretation methods that, at least in part, abstract away from model specifics and operate in the embedding space only.


INTRODUCTION
Transformer-based models [Vaswani et al., 2017] currently dominate Natural Language Processing [Devlin et al., 2018;Radford et al., 2019;Zhang et al., 2022] as well as many other fields of machine learning [Dosovitskiy et al., 2020;Chen et al., 2020;Baevski et al., 2020]. Consequently, understanding their inner workings has been a topic of great interest. Typically, work on interpreting Transformers relies on feeding inputs to the model and analyzing the resulting activations [Adi et al., 2016;Shi et al., 2016;Clark et al., 2019]. Thus, interpretation involves an expensive forward, and sometimes also a backward pass, over multiple inputs. Moreover, such interpretation methods are conditioned on the input, and are not guaranteed to generalize to all inputs. In the evolving literature on static interpretation, i.e., without forward or backward passes, Geva et al. [2022b] showed that the value vectors of the Transformer feed-forward module (the second layer of the feed-forward network) can be interpreted by projecting them into the embedding space, i.e., multiplying them by the embedding matrix to obtain a representation over vocabulary items. Elhage et al. [2021] have shown that in a 2-layer attention network, weight matrices can be interpreted in the embedding space as well.
In this work, we extend the theoretical analysis and findings of Elhage et al. [2021] and Geva et al. [2022b], and present a zero-pass framework to understand the behaviour of Transformers. Conceretely, we interpret all weights of a pretrained language model (LM) in embedding space, including both keys and values of the feed-forward module as well as all attention parameters.
Our theory relies on a simple observation. Since Geva et al. [2022b] have shown that one can project hidden states to the embedding space via the embedding matrix, we can extend this to other parts of the model by projecting to the embedding space and then projecting back by multiplying with a right-inverse of the embedding matrix. Thus, we can recast inner products in the model as inner products in embedding space. Viewing inner products in this way, we can interpret such  Figure 1: Applications of the embedding space view. Left: interpreting parameters in embedding space. The most active vocabulary items for an example feed-forward key (k) and a feed-forward value (v). The most active pairs of vocabulary items for an example attention query-key matrix WQK and an attention value-output matrix WVO (see §2). Center: Aligning the parameters of different BERT instances that share a vocabulary. Right: Zero-shot "stitching", where representations of a fine-tuned classifier are translated through the embedding space (multiplying by EAE −1 B ) to a pretrained-only model.
products as interactions between pairs of vocabulary items. 1 This applies to (a) interactions between attention queries and keys as well as to (b) interactions between attention value vectors and the parameters that project them at the output of the attention module. Taking this perspective to an extreme, one can view Transformers as operating implicitly in the embedding space. This entails the existence of a single linear space that depends solely on the tokenizer, in which parameters of different Transformers can be compared. Thus, one can use the embedding space to compare and transfer information across different models that share a tokenizer.
We provide extensive empirical evidence for the credibility of our proposal. On the interpretation front ( Fig. 1, Left), we provide qualitative and quantitative evidence that Transformer parameters can be interpreted in embedding space. We also show that when fine-tuning a pretrained LM on a sentiment analysis task (over movie reviews), projecting changes in parameters into embedding space yields words that characterize sentiment towards movies. Second (Fig. 1, Center), we show that given two distinct instances of BERT pretrained with different random seeds [Sellam et al., 2022], we can align layers of the two instances by casting their weights into the embedding space. We find that indeed layer i of the first instance aligns well to layer i of the second instance, showing the different BERT instances converge to a semantically-similar solution. Last (Fig. 1, Right), we take a model fine-tuned on a sentiment analysis task and "transfer" the learned weights to a different model that was only pretrained by going through the embedding spaces of the two models. We show that in 30% of the cases, this procedure, termed stitching, results in a classifier that reaches an impressive accuracy of 70% on the IMDB benchmark [Maas et al., 2011] without any training.
Overall, our findings suggest that analyzing Transformers in embedding space is fruitful for both interpretability and as a tool to relate different models that share a vocabulary, and opens the door to interpretation methods that operate in embedding space only. Our code is available at https://github.com/guyd1995/embedding-space.

BACKGROUND
We now present the main components of the Transformer [Vaswani et al., 2017] relevant to our analysis. We discuss the residual stream view of Transformers, and recapitulate a view of the attention layer parameters as interaction matrices W VO and W QK [Elhage et al., 2021]. Similar to Elhage et al.
[2021], we exclude biases and layer normalization from our analysis.

TRANSFORMER ARCHITECTURE
The Transformer consists of a stack of layers, each includes an attention module followed by a Feed-Forward (FF) module. All inputs and outputs are sequences of N vectors of dimensionality d.
The Attention Module takes as input a sequence of representations X ∈ R N ×d , and each layer L is parameterized by four matrices W (L) ∈ R d×d (we henceforth omit the layer superscript for brevity). The input X is projected to produce queries, keys, and values: Q att = XW Q , K att = XW K , V att = XW V . Each one of Q att , K att , V att is split along the columns to H different heads of dimensionality R N × d H , denoted by Q i att , K i att , V i att respectively. We then compute H attention maps: where M ∈ R N ×N is the attention mask. Each attention map is applied to the corresponding value head as A i V i att , results are concatenated along columns and projected via W O . The input to the module is added via a residual connection, and thus the attention module's output is: The FF Module is a two-layer neural network, applied to each position independently. Following past terminology [Sukhbaatar et al., 2019;Geva et al., 2020], weights of the first layer are called FF keys and weights of the second layer FF values. This is an analogy to attention, as the FF module too can be expressed as: f (QK T )V , where f is the activation function, Q ∈ R N ×d is the output of the attention module and the input to the FF module, and K, V ∈ R d ff ×d are the weights of the first and second layers of the FF module. Unlike attention, keys and values are learnable parameters. The output of the FF module is added to the output of the attention module to form the output of the layer via a residual connection. The output of the i-th layer is called the i-th hidden state.
Embedding Matrix To process sequences of discrete tokens, Transformers use an embedding matrix E ∈ R d×e that provides a d-dimensional representation to vocabulary items before entering the first Transformer layer. When training Transformers with a language modeling objective, the same embedding matrix E is often used [Press and Wolf, 2016] to take the output of the last Transformer layer and project it back to the vocabulary dimension, i.e., into the embedding space. In this work, we will interpret all components of the Transformer model in the embedding space.

THE RESIDUAL STREAM
We rely on a useful view of the Transformer through its residual connections proposed by Elhage et al. [2021]. 2 Specifically, each layer takes a hidden state as input and adds information to the hidden state through its residual connection. Under this view, the hidden state is a residual stream passed along the layers, from which information is read, and to which information is written at each layer. Elhage et al. [2021] and Geva et al. [2022b] observed that the residual stream is often barely updated in the last layers, and thus the final prediction is determined in early layers and the hidden state is mostly passed through the later layers.
An exciting consequence of the residual stream view is that we can project hidden states in every layer into embedding space by multiplying the hidden state with the embedding matrix E, treating the hidden state as if it were the output of the last layer. Geva et al. [2022a] used this approach to interpret the prediction of Transformer-based language models, and we follow a similar approach.
Importantly, W i QK , W i VO are input-independent. Intuitively, W QK encodes the amount of attention between pairs of tokens. Similarly, in W i VO , the matrices W V and W O can be viewed as a transition matrix that determines how attending to certain tokens affects the subsequent hidden state. We can restate the attention equations in terms of the interaction matrices. Recall (Eq. 1) that the output of the i'th head of the attention module is A i V i att and the final output of the attention module is (without the residual connection): Similarly, the attention map A i at the i'th head in terms of W QK is (softmax is done row-wise):

PROJECTING TRANSFORMER PARAMETERS INTO EMBEDDING SPACE
In this section, we propose that Transformer parameters can be projected into embedding space for interpretation purposes. Our results extend Elhage et al. [2021] who obtained similar results for a two-layer attention-only network. We empirically support our framework in §4- §5.
Given a matrix A ∈ R N ×d , we can project it into embedding space by multiplying by the embedding matrix E asÂ = AE ∈ R N ×e . Let E be a right-inverse of E, that is, EE = I ∈ R d×d . 3 Then we can reconstruct the original matrix with E as A = A(EE ) =ÂE . We will use this simple identity to reinterpret the model's operation in embedding space. To simplify our analysis, we ignore layer norms and biases, a standard simplification justified in prior work [Elhage et al., 2021].
In interpretation experiments ( §4), we do not use an exact right inverse such as the Moore-Penrose pseudo-inverse [Moore, 1920;Bjerhammar, 1951;Penrose, 1955] but instead use the transpose of the embedding matrix E = E T . This is since interpretation involves not only projecting using E but also applying a top-k operation where we inspect the vocabulary items with the largest logits. We empirically find that the Moore-Penrose pseudo-inverse does not work well for interpretation due to the top-k operation, and provide a justification and comprehensive empirical evidence in Appendix A. Conversely, E T empirically works well, and we conjecture this is due to the training procedure of LMs where E is used to embed discrete tokens into the hidden state dimension and E T is used to predict a distribution over the vocabulary items from the last hidden state.
is the interaction matrix between attention values and the output projection matrix for attention head i. By definition, the output of each head is: Since the output of the attention module is added to the residual stream, we can assume according to the residual stream view that it is meaningful to project it to the embedding space, similar to FF values. Thus, we expect the sequence of N e-dimensional vectors to be interpretable. Importantly, the role of A i is just to mix the representations of the updated N input vectors. This is similar to the FF module, where FF values (the parameters of the second layer) are projected into embedding space, and FF keys (parameters of the first layer) determine the coefficients for mixing them. Hence, we can assume that the interpretable components are in the termX(E W i VO E). Zooming in on this operation, we see that it takes the previous hidden state in the embedding space (X) and produces an output in the embedding space which will be incorporated into the next hidden state through the residual stream. Thus, E W i VO E is a transition matrix that takes a representation the embedding space and outputs a new representation in the same space.

Symbol Projection Approximate Projection
Similarly, the matrix W i QK can be viewed as a bilinear map (Eq. 3). To interpret it in embedding space, we perform the following operation with E : Therefore, the interaction between tokens at different positions is determined by an e × e matrix that expresses the interaction between pairs of vocabulary items. Geva et al. [2022b] showed that FF value vectors V ∈ R d ff ×d are meaningful when projected into embedding space, i.e., for a FF value vector v ∈ R d , vE ∈ R e is interpretable (see §2.1). In vectorized form, the rows of V E ∈ R d ff ×e are interpretable. On the other hand, the keys K of the FF layer are multiplied on the left by the output of the attention module, which are the queries of the FF layer. Denoting the output of the attention module by Q, we can write this product as QK T =QE K T =Q(KE T ) T . Because Q is a hidden state, we assume according to the residual stream view thatQ is interpretable in embedding space. When multiplyingQ by KE T , we are capturing the interaction in embedding space between each query and key, and thus expect KE T to be interpretable in embedding space as well.

FF Module
Overall, FF keys and values are intimately connected -the i-th key controls the coefficient of the i-th value, so we expect their interpretation to be related. While not central to this work, we empirically show that key-value pairs in the FF module are similar in embedding space in Appendix B.1.
Subheads Another way to interpret the matrices W i VO and W i QK is through the subhead view. We use the following identity: AB = b j=1 A :,j B j,: , which holds for arbitrary matrices A ∈ R a×b , B ∈ R b×c , where A :,j ∈ R a×1 are the columns of the matrix A and B j,: ∈ R 1×c are the rows of the matrix B. Thus, we can decompose W i VO and W i QK into a sum of d H rank-1 matrices: We call these vectors subheads. This view is useful since it allows us to interpret subheads directly by multiplying them with the embedding matrix E. Moreover, it shows a parallel between interaction matrices in the attention module and the FF module. Just like the FF module includes key-value pairs as described above, for a given head, its interaction matrices are a sum of interactions between pairs of subheads (indexed by j), which are likely to be related in embedding space. We show this is indeed empirically the case for pairs of subheads in Appendix B.1.
We summarize our approach for projecting the different components of the Transformer into embedding space in Table 1.

INTERPRETABILITY EXPERIMENTS
In this section, we provide empirical evidence for the viability of our approach as a tool for interpreting Transformer parameters.

PARAMETER INTERPRETATION EXAMPLES
We take GPT-2 medium [Radford et al., 2019] and manually analyze its parameters. GPT-2 medium has a total of 384 attention heads (24 layers and 16 heads per layer). We take the embedded transition matrices E W i VO E for all heads and examine the top-k pairs of vocabulary items. As there are only 384 heads, we manually choose a few heads and present the top-k pairs in Appendix C.1 (k = 50). We observe that different heads capture different types of relations between pairs of vocabulary items including word parts, heads that focus on gender, geography, orthography, particular part-of-speech tags, and various semantic topics. In Appendix C.2 we perform a similar analysis for W QK . Appendix C.3 provides examples of key-value pairs from the FF modules of GPT-2 medium. We show random pairs (k, v) from the set of those pairs such that when looking at the top-100 vocabulary items for k and v, at least 15% overlap. Such pairs account for approximately 5% of all key-value pairs. The examples show how key-value pairs often revolve around similar topics such as media, months, organs, etc.
Last, we show we can use embeddings to locate FF values (or keys) related to a particular topic. We take a few vocabulary items related to a certain topic, e.g., ['cm', 'kg', 'inches'], average their embeddings, 4 and rank all FF values (or keys) based on their dot-product with the average. Appendix C.4 shows a few examples of FF values found with this method that are related to programming, measurements, and animals.

HIDDEN STATE AND PARAMETERS
An advantage of zero-pass interpretation is that it does not require running inputs through the model which is expensive and non-exhaustive. In this section (and this section only), we run a forward pass over inputs and examine if the representations in embedding space of dynamically-computed hidden states are "similar" to the representations of static parameter vectors that are activated.
A technical side note: we use GPT-2, which applies layer norm to the Transformer output before projecting it to the embedding space with E. Thus, conservatively, layer norm should be considered as part of the projection operation. 5 Empirically however, we observe that projecting parameters directly without layer norm works well, which simplifies our analysis in §3. An exception is when projecting hidden states in this section, where we apply layer norm before projection to improve performance, similar to Geva et al. [2022a].
Experimental Design We use GPT-2 medium and run it over 60 examples from IMDB [Maas et al., 2011]. This provides us with a dynamically-computed hidden state h for every token and at the output of every layer. For the projectionĥ ∈ R e of each such hidden state, we take the projections of the m most active parameter vectors {x i } m i=1 in the layer that computed h and check if they cover the dominant vocabulary items ofĥ in embedding space. Specifically, let top-k(wE) be the k vocabulary items with largest logits in embedding space for a vector w ∈ R d . We compute: to capture if activated parameter vectors cover the main vocabulary items corresponding to the hidden state.
We find the m most active parameter vectors separately for FF keys (K), FF values (V ), attention value subheads (W V ) (see §3), and attention output subheads (W O ), where the activation of each parameter vector is determined by the vector's "coefficient" as follows. For a FF key-value pair Figure 2 presents the R k score averaged across tokens per layer. As a baseline, we compare R k of the activated vectors {x i } m i=1 with the correctly-aligned hidden statê h at the output of the relevant layer (blue bars) against the the R k when randomly samplingĥ rand from the set of all hidden states (orange bars). We conclude that the representations in embedding space induced by activated parameter vector mirror, at least to some extent, the representations of the hidden states themselves. Appendix §B.2 shows a variant of this experiment, where we compare activated parameters throughout GPT2-medium's layers to the last hidden state, which produces the logits used for prediction.

INTERPRETATION OF FINE-TUNED MODELS
We now show that we can interpret the changes a model goes through during fune-tuning through the lens of embedding space. We fine-tune the top-3 layers of the 12-layer GPT-2-base with a sequence classification head on IMDB sentiment analysis (binary classification) and compute the difference between the original parameters and the fine-tuned model. We then project the difference of parameter vectors into embedding space and test if change is interpretable w.r.t sentiment analysis.
Appendix D shows examples for projected differences randomly sampled from the fine-tuned layers. Frequently, the difference, or its negation, is projected to nouns, adjectives and adverbs that express sentiment for a movie, such as 'amazing', 'masterpiece', 'incompetence', etc. This shows that the differences are indeed projected into vocabulary items that characterize movie reviews' sentiment. Almost all parameter groups present this behavior, except for V and W O , which curiously are the parameters added to the residual stream.

ALIGNING MODELS IN EMBEDDING SPACE
Assuming Transformers by and large operate in embedding space leads to an exciting possibilitywe can relate different models to one another so long as they share a vocabulary and tokenizer. In §5.1, we show that we can align the layers of BERT models trained with different random seeds. In §5.2, we show the embedding space can be leveraged to "stitch" the parameters of a fine-tuned model to a model that was not fine-tuned. Figure 3: Left: Aligning in embedding space the layers of two different BERT models initialized from different random seeds for all parameter groups. Layers that have the same index tend to align with one another. Right: Alignment in feature space leads to unintelligible patterns.

LAYER ALIGNMENT
Experimental Design Taking our approach to the extreme, the embedding space is a universal space, which depends only on the tokenizer, and in which Transformer parameters and hidden states reside. Consequently, we can align parameter vectors from different models in this space and compare them even if they come from different models, as long as they share a vocabulary.
To demonstrate this, we use MultiBERT [Sellam et al., 2022], which contains 25 different instantiations of BERT initialized from different random seeds. We take parameters from two MultiBERT seeds and compute the Pearson correlation between their projection to embedding space. For example, let V A , V B be the FF values of models A and B. We can project the values into embedding space: are the respective embedding matrices, and compute Pearson correlation between projected values. This produces a similarity matrixS ∈ R |V A |×|V B | , where each entry is the correlation coefficient between projected values from the two models. We binS by layer pairs and average the absolute value of the scores in each bin (different models might encode the same information in different directions, so we use absolute value) to produce a matrix S ∈ R L×L , where L is the number of layers. Specifically, the average (absolute) correlation between vectors that come from layer A in model A and layer B in Model B is registered in entry ( A , B ) of S.
Last, to obtain a one-to-one layer alignment, we use the Hungarian algorithm [Kuhn, 1955], which assigns exactly one layer from the first model to a layer from the second model. The algorithm's objective is to maximize, given a similarity matrix S, the sum of similarities of the chosen pairs, such that each index in one model is matched with exactly one index in the other. We repeat this for all parameter groups (W Q , W K , W V , W O , K). Figure 3 (left) shows the resulting alignment. Clearly, parameters from a certain layer in model A tend to align to the same layer in model B across all parameter groups. This suggests that different layers from different models that were trained separately (but with the same training objective and data) serve a similar function. As further evidence, we show that if not projected, the matching appears absolutely random in Figure  §3 (right). We show the same results for other seed pairs as well in Appendix B.3.

ZERO-SHOT STITCHING
Model stitching [Lenc and Vedaldi, 2015;Csiszárik et al., 2021;Bansal et al., 2021] is a relatively under-explored feature of neural networks, particularly in NLP. The idea is that different models, sometimes trained on different data and with different architectures, learn representations that can be aligned through a linear transformation, termed stitching. Representations correspond to hidden states , and thus one can learn a transformation matrix from one model's hidden states to an equivalent hidden state in the other model. Here, we show that going through embedding space one can align the hidden states of two models, i.e., stitch, without training.
Given two models, we want to find a linear stitching transformation to align their representation spaces. According to our theory, given a hidden state v ∈ R d1 from model A, we can project it to the embedding space as vE A , where E A is its embedding matrix. Then, we can re-project to the feature space of model B, with E + B ∈ R e×d2 , where E + B is the Penrose-Moore pseudo-inverse of : Accuracy on IMDB evaluation set. We ran stitching randomly 11 times and obtained 3 models with higher than random accuracy when stitching over top layers. Dashed red line indicates random performance.
the embedding matrix E B . 6 This transformation can be expressed as multiplication with the kernel . We employ the above approach to take representations of a fine-tuned classifier, A, and stitch them on top of a model B that was only pretrained, to obtain a new classifier based on B.
Experimental Design We use the 24-layer GPT-2 medium as model A and 12-layer GPT-2 base model trained in §4.3 as model B. We fine-tune the last three layers of model B on IMDB, as explained in §4.3. Stitching is simple and is performed as follows. Given the sequence of N hidden states H A ∈ R N ×d1 at the output of layer of model A ( is a hyperparameter), we apply the stitching layer, which multiplies the hidden states with the kernel, computing H A K AB . This results in hidden states H B ∈ R N ×d2 , used as input to the three fine-tuned layers from B.

Results and Discussion
Stitching produces models with accuracies that are higher than random on IMDB evaluation set, but not consistently. Figure 4 shows the accuracy of stitched models against the layer index from model A over which stitching is performed. Out of 11 random seeds, three models obtained accuracy that is significantly higher than the baseline 50% accuracy, reaching an accuracy of roughly 70%, when stitching is done over the top layers.

RELATED WORK
Interpreting Transformer is a broad area of research that has attracted much attention in recent years. A large body of work has focused on analyzing hidden representations, mostly through probing [Adi et al., 2016;Shi et al., 2016;Tenney et al., 2019;Rogers et al., 2020]. Voita et al. [2019a] used statistical tools to analyze the evolution of hidden representations throughout layers. Recently, Mickus et al. [2022] proposed to decompose the hidden representations into the contributions of different Transformer components. Unlike these works, we interpret parameters rather than the hidden representations.
Our work is most related to efforts to interpret specific groups of Transformer parameters. Cammarata et al. [2020] made observations about the interpretability of weights of neural networks. Elhage et al. [2021] analyzed 2-layer attention networks. We extend their analysis to multi-layer pre-trained Transformer models. Geva et al. [2020;2022a;b] interpreted feedforward values in embedding space. We coalesce these lines of work and offer a unified interpretation framework for Transformers in embedding space.

DISCUSSION
Our work has a few limitations that we care to highlight. First, it focuses on interpreting models through the vocabulary lens. While we have shown evidence for this, it does not preclude other factors from being involved in the computation process. Second, we used E = E T , but future research might find variants of E that improve performance. Last, we assume Transformer components can be projected to the embedding space with a single matrix multiplication, but this might depend on model training, e.g., in GPT-2 it involves a layer norm operation as explained in §4.2.
Notwithstanding, we believe the benefits of our work overshadow its limitations. We provide a simple and efficient approach, which equips researchers with new tools to interpret Transformer models and relate them to one another. Apart from Elhage et al. [2021], there has been little work pursuing the embedding space approach, and we "sharpen" the tools they laid down and adjust them to existing pre-trained Transformers. Moreover, our framework allows us to view parameters from different models as residents of the same universal embedding space, where they can be compared in model-agnostic fashion. We demonstrate two applications of this observation (model alignment and stitching) and argue future work can yield many additional applications.

A RETHINKING INTERPRETATION
The process of interpreting a vector v in Geva et al.
[2022b] proceeds in two steps: first the projection of the vector to the embedding space (vE); then, we use the list of the tokens that were assigned the largest values in the projected vector, i.e.: top-k(vE), as the interpretation of the projected vector. This is reasonable since (a) the most activated coordinates contribute the most when added to the residual stream, and (b) this matches how we eventually decode: we project to the embedding space and consider the top-1 token (or one of the few top tokens, when using beam search).
In this work, we interpret inner products and matrix multiplications in the embedding space: given two vectors x, y ∈ R d , their inner product x T y can be considered in the embedding space by multiplying with E and then by one of its right inverses (e.g., its pseudo-inverse E + [Moore, 1920;  , 1951;Penrose, 1955]):

Bjerhammar
Assume xE is interpretable in the embedding space, crudely meaning that it represents logits over vocabulary items. We expect y, which interacts with x, to also be interpretable in the embedding space. Consequently, we would like to take yE +T to be the projection of y. However, this projection does not take into account the subsequent interpretation using top-k. The projected vector yE +T might be harder to interpret in terms of its most activated tokens. To alleviate this problem, we need a different "inverse" matrix E that works well when considering the top-k operation. Formally, we want an E with the following "robustness" guarantee: keep-k(xE) T keep-k(yE ) ≈ x T y, where keep-k(v) is equal to v for coordinates whose absolute value is in the top-k, and zero elsewhere.
This is a stronger notion of inverse -not only is EE ≈ I, but even when truncating the vector in the embedding space we can still reconstruct it with E .
We claim that E T is a decent instantiation of E and provide some empirical evidence. While a substantive line of work [Ethayarajh, 2019;Gao et al., 2019;Wang et al., 2020;Rudman et al., 2021] has shown that embedding matrices are not isotropic (an isotropic matrix E has to satisfy EE T = αI for some scalar α), we show that it is isotropic enough to make E T a legitimate compromise. We randomly sample 300 vectors drawn from the normal distribution N (0, 1), and compute for every pair x, y the cosine similarity between x T y and keep-k(xE) T keep-k(yE ) for k = 1000, and then average over all pairs. We repeat this for E ∈ {E +T , E} and obtain a score of 0.10 for E +T , and 0.83 for E, showing the E is better under when using top-k. More globally, we compare E ∈ {E +T , E} for k ∈ {10, 50, 100, 200, 300, 500} with three distributions: -x, y drawn from the normal N (0, 1) distribution x, y chosen randomly from the FF values x, y drawn from hidden states along Transformer computations.
In Figure 5 (Left) we show the results, where dashed lines represent E + and solid lines represent E T . For small values of k (used for interpretation), E T is superior to E + across all distributions. Interestingly, the hidden state distribution is the only distribution where E + has similar performance to E T . Curiously, when looking at higher values of k the trend is reversed (k = {512, 1024, 2048, 4096, 10000, 15000, 20000, 30000}) -see Figure 5 (Right).
This settles the deviation from findings showing embedding matrices are not isotropic, as we see that indeed as k grows, E T becomes an increasingly bad approximate right-inverse of the embedding matrix. The only distribution that keeps high performance with E T is the hidden state distribution, which is an interesting future direction of investigation.  We define the following metric applying on vectors after projecting them into the embedding space: where top-k(v) is the set of k top activated indices in the vector v (which correspond to tokens in the embedding space). This metric is the Jaccard index [Jaccard, 1912] applied to the top-k tokens from each vector. In Figure 6, Left, we demonstrate that corresponding FF key and value vectors are more similar (in embedding space) than two random key and value vectors. In Figure 6, Right, we show a similar result for attention value and output vectors. In Figure 6, Bottom, the same analysis in done for attention query and key vectors. This shows that there is a much higher-than-chance relation between corresponding FF keys and values (and the same for attention values and outputs).

B.2 FINAL PREDICTION AND PARAMETERS
We show that the final prediction of the model is correlated in embedding space with the most activated parameters from each layer. This implies that these objects are germane to the analysis of the final prediction in the embedding space, which in turn suggests that the embedding space is a viable choice for interpreting these vectors. Figure 7 shows that just like §4.2, correspondence is better when hidden states are not randomized, suggesting there parameter interpretations have an impact on the final prediction.

CLASSIFICATION HEAD PARAMETERS
Below we show the finetuning vector of the classifier weight. "POSITIVE" designates the vector corresponding to the label "POSITIVE", and similarly for "NEGATIVE". In the following sub-sections, we sample 4 difference vectors per each parameter group (FF keys, FF values; attention query, key, value, and output subheads), and each one of the fine-tuned layers (layers 9-11). We present the ones that seemed to contain relevant patterns upon manual inspection. We also report the number of "good" vectors among the four sampled vectors for each layer and parameter group.