All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality

Similarity measures are a vital tool for understanding how language models represent and process language. Standard representational similarity measures such as cosine similarity and Euclidean distance have been successfully used in static word embedding models to understand how words cluster in semantic space. Recently, these measures have been applied to embeddings from contextualized models such as BERT and GPT-2. In this work, we call into question the informativity of such measures for contextualized language models. We find that a small number of rogue dimensions, often just 1-3, dominate these measures. Moreover, we find a striking mismatch between the dimensions that dominate similarity measures and those which are important to the behavior of the model. We show that simple postprocessing techniques such as standardization are able to correct for rogue dimensions and reveal underlying representational quality. We argue that accounting for rogue dimensions is essential for any similarity-based analysis of contextual language models.


Introduction
By mapping words into continuous vector spaces, we can reason about human language in geometric terms. For example, the cosine similarity of pairs of word embeddings in Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) shows a robust correlation with human similarity judgments, and embeddings cluster into natural semantic classes in Euclidean space (Baroni et al., 2014;Wang et al., 2019). In recent years, static embeddings have given way to their contextual counterparts, with language models based on the transformer architecture (Vaswani et al., 2017) such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2020), XLNet (Yang et al., 2019) and GPT-2 (Radford et al., 2019) achieving state of the art results on many language understanding tasks. Despite their success, relatively little is known about how these models represent and process language. Recent work has employed measures such as cosine similarity and Euclidean distance to contextual representations with unclear and counterintuitive results. For example, similarity/distance measures in BERT are extremely sensitive to word position, leading to inconsistent results on evaluation benchmarks (Mickus et al., 2020;May et al., 2019). Additionally, representational quality appears to degrade severely in later layers of each network, with the final layers of BERT, RoBERTa, GPT-2 and XLNet showing little to no correlation with the semantic similarity/relatedness judgments of humans (Bommasani et al., 2020).
Recent work which probes the representational geometry of contextualized embedding spaces using cosine similarity has found that contextual embeddings have several counterintuitive properties (Ethayarajh, 2019). For example: 1) Word representations are highly anisotropic: randomly sampled words tend to be highly similar to one another when measured by cosine similarity. In the final layer of GPT-2 for example, any two words are almost perfectly similar. 2) Embeddings have extremely low self-similarity: In later layers of transformer-based language models, random words are almost as similar to one another as instances the same word in different contexts.
In this work, we critically examine the informativity of standard similarity/distance measures (particularly cosine similarity and Euclidean distance) in contextual embedding spaces. We find that these measures are often dominated by 1-5 dimensions across all the contextual language models we tested, regardless of the specific pretraining objective. It is this small subset of dimensions which drive anisotropy, low self-similarity, and the apparent drop in representational quality in later layers. These dimensions, which we refer to as rogue dimensions are centered far from the origin and have disproportionately high variance. The presence of rogue dimensions can cause cosine similarity and Euclidean distance to rely on less than 1% of the embedding space. Moreover, we find that the rogue dimensions which dominate cosine similarity do not likewise dominate model behavior, and show a strong correlation with absolute position and punctuation.
Finally, we show that these dimensions can be accounted for using a trivially simple transformation of the embedding space: standardization. Once applied, cosine similarity more closely reflects human word similarity judgments, and we see that representational quality is preserved across all layers rather than degrading/becoming task-specific. Taken together, we argue that accounting for rogue dimensions is essential when evaluating representational similarity in transformer language models. 1
However, a number of works have questioned the appropriateness of cosine similarity. Schnabel et al. (2015) found that static embedding models encode a substantial degree of word frequency information, which leads to a frequency bias in cosine similarity. May et al. (2019) questioned the adequacy of cosine similarity in sentence encoders after finding contextual discrepancies in bias measures. Perhaps most relevant to the present work is Zhelezniak et al. (2019) which treats individual word embed-dings as statistical samples, shows the equivalence of cosine similarity and Pearson correlation, and notes that Pearson correlation (and therefore cosine similarity) is highly sensitive to outlier dimensions. They further suggest the use of non-parametric rank correlation measures such as Spearman's ρ, which is robust to outliers. Our work investigates the sensitivity of cosine similarity to outlier dimensions in contextual models, and further characterizes the behavioral correlates of these outliers.
Our goal in this work was not causal explanation of degenerate embedding spaces or post-processing for task performance gains, but rather to empirically motivate trivially simple transformations to enable effective interpretability research with existing metrics. However, we refer interested readers to Gao et al. (2019) who studied degeneration toward anisotropy in machine translation. Similarly, Li et al. (2020) suggested a learned transformation of transformer embedding spaces which resulted in increased performance on semantic textual similarity tasks.

Rogue Dimensions and
Representational Geometry

Anisotropy
In this section, we investigate how each dimension of the embedding space contributes to anisotropy, defined by Ethayarajh (2019) as the expected cosine similarity of randomly sampled token pairs. They showed that contextual embedding spaces are highly anisotropic, meaning that the contextual representations of any two tokens are expected to be highly similar to one another. We investigate this counterintuitive property by decomposing the cosine similarity computation by dimension, and show that the cosine similarity of any two tokens is dominated by a small subset of rogue dimensions. We conclude that anisotropy is not a global property of the entire embedding space, but is instead driven by a small number of idiosyncratic dimensions.

Setup
Ethayarajh (2019) defines the anisotropy in layer of model f as the expected cosine similarity of any pair of words in a corpus. This can be approximated asÂ( f ) from a sample S of n random token pairs from a corpus O. S = {{x 1 , y 1 }, ..., {x n , y n }} ∼ O: The cosine similarity, between two vectors u and v of dimensionality d is defined as Expressing cosine similarity as a summation over d dimensions, we can define a function CC i (u, v) which gives contribution of dimension i to the total cosine similarity of u and v as: From this, we define CC( f i ), the contribution of dimension i toÂ( f ) as: From the mean cosine contribution by dimension, we can determine how much each dimension contributes to the total anisotropy. If CC( f 1 ) ≈ CC( f 2 ) ≈ ... ≈ CC( f d ) then we conclude that anisotropy is a global property of the embedding space; no one dimension drives the expected cosine similarity of any two embeddings. By contrast, if CC( f i ) >> ∑ d j =i CC( f j ) then we conclude that dimension i dominates the cosine similarity computation.

Experiment
We compute the average cosine similarity contribution, CC( f i ), for each dimension in all layers of BERT, RoBERTa, GPT-2, and XLNet. 2 We then normalize by the total expected cosine similaritŷ A( f ) to get the proportion of the total expected cosine similarity contributed by each dimension. All models are of dimensionality d = 768 and have 12 layers, plus one static embedding layer. We also include two 300 dimensional non-contextual models, Word2Vec 3 and GloVe, 4 for comparison. Our corpus O is an 85k token sample of random articles from English Wikipedia. All input sequences  Table 1: Proportion of total expected cosine similarity, CC( f i )/Â( f ), contributed by each of the top 3 dimensions in the two most anisotropic layers of each model, along with the anisotropy estimateÂ( f ) for the given layer. Results for all layers can be found in Table 4 of the appendix.
consisted of 128 tokens. From the resulting representations we take a random sample S of 500k token pairs. For each model, we report the three dimensions with the largest cosine contributions in the two most anisotropic layers, as well as the overall anisotropyÂ( f ).

Results and Discussion
Results are summarized in Table 1. The static models Word2Vec and GloVe are relatively isotropic and are not dominated by any single dimension. Across all transformer models tested, a small subset of rogue dimensions dominate the cosine similarity computation, especially in the more anisotropic final layers. Perhaps the most striking case is layers 10 and 11 of XLNet, where a single dimension contributes more than 99% of the expected cosine similarity between randomly sampled tokens.
The dimensions which drive anisotropy are centered far from the origin relative to other dimensions. One implication of anisotropy is that the embeddings occupy a narrow cone in the embedding space, as the angle between any two word embeddings is very small. However, if anisotropy is driven by a single dimension (or a small subset of dimensions), we can conclude that the cone lies along a single axis or within a low dimensional subspace, rather than being a global property across all dimensions. 5 We conclude from this analysis that the anisotropy of the embedding space is an artifact of cosine similarity's high sensitivity to a small set of outlier dimensions and is not a global property of the space. 6

Informativity of Similarity Measures
In the previous section, we found that anisotropy is driven by a small subset of dimensions. In this section, we investigate whether standard similarity measures are still informed by the entire embedding space, or if variability in the measure is also driven by a small subset of dimensions.
For example, it could be the case that some dimension i has a large, but roughly constant activation across all tokens, meaning E[CC( f i )] will be large, but Var[CC( f i )] will be near zero. In this case, we would be adding a large constant to cosine similarity, making Anisotropy( f ) large but not changing Var[cos( f (x), f (y)]. In this case, the average cosine similarity would be driven toward 1.0 by dimension i, but any changes in cosine similarity would be driven by the rest of the embedding space, not dimension i, meaning cosine similarity would provide information about the entire representation space, rather than a single dimension. Conversely, dimension i may have mean activation near zero, but extremely large variance across tokens. In this case, dimension i would not appear to make the space anisotropic, but would still drive variability in cosine similarity. Ultimately, we're not interested in where the representation space is centered, but whether changes in a similarity measure reflect changes in the entire embedding space.
In this section we uncover which dimensions drive the variability of cosine similarity. 7 Paralleling our findings in Section 3.1 we find that the token pairs which are similar/dissimilar to one another completely change when we remove just 1-5 dominant dimensions from the embedding space.

Setup
Let f (x) : X − → R d , be the function which maps a token x to its representation in layer l of model 5 Our analysis complements that of Cai et al. (2021) which used Principle Component Analysis to identify isolated isotropic clusters as well as embedding cones in a space reduced to three dimensions. 6 We additionally replicated Ethayarajh (2019) before and after removing rogue dimensions in Appendix A. We show that their analyses are extremely sensitive to rogue dimensions. 7 We conduct the same analysis using Euclidean distance in Appendix B and reach similar conclusions as with cosine similarity.

Model
Layer k=1 Table 2: Proportion of variance in cosine similarity r 2 explained by cosine similarity when the top k dimensions, measured by CC( f i ), are removed. Layer 0 is the static embedding layer. Results for all layers can be found in Table 5 of the Appendix.
f . Let f (x) : X − → R d−k be the function which maps token x to its representation with top k dimensions (measured by contribution to cosine similarity) removed. Let C(S) = cos In this analysis, we compute: This is the Pearson correlation between the cosine similarities in the entire embedding space and those similarities when k dimensions are removed. In our analysis we report r 2 which corresponds to the proportion of variance in C(S) explained by C (S). For example, if we were to set k=1, and the observed r 2 is large, then cosine similarity in the full embedding space is still well explained by the remaining d − 1 dimensions. By contrast, if r 2 is small, then the variance of cosine similarity in the embedding space can not be well explained by the bottom d − 1 dimensions, and thus a single dimension drives variability in cosine similarity.

Experiment
For this experiment, we compute r 2 = Corr[C(x, y),C (x, y)] 2 for all layers of all models, using the same set of token representations as in Section 3.1. We remove the top k = {1, 3, 5} dimensions, where dimensions are ranked by CC( f i ), the cosine similarity contribution of dimension i in layer l. We report results for the first layer and the final two layers. Results for all layers can be found in Table 5 of the Appendix. Table 2. We find that in the static embedding models and the earlier layers of each contextual model, no single dimension or subset of dimensions drives the variability in cosine similarity. By contrast, in later layers, the variability of cosine similarity is driven by just 1-5 dimensions. In the extreme cases of XLNet-12 and BERT-11, when we remove just a single dimension from the embedding space, almost none of the variance in cosine similarity can be explained by cosine similarity in the d − 1 dimensional subspace. (r 2 = 0.028 and 0.046 respectively) This means that the token pairs which are similar to one another in the full embedding space are drastically different from the pairs which are similar when just a handful of dimensions are removed.

Results are summarized in
While similarity measures should reflect properties of the entire embedding space, we have shown that this is not the case with cosine similarity in contextualized embedding spaces. Not only do a small subset of dimensions in later layers drive the cosine similarity of randomly sampled words toward 1.0, but this subset also drives the variability of the measure. This result effectively renders cosine similarity a measure over 1-5 rogue dimensions rather than the entire embedding space.

Rogue Dimensions and Model Behavior
In

Behavioral Influence of Individual Dimensions
We measure the influence of individual dimensions on model behavior through an ablation study in the style of Morcos et al. (2018). 8 The idea of neuron ablation studies is to examine how the performance of a network changes when a neuron is clamped to a fixed value, typically zero. In our study, we measure how much the language modeling distribution changes when dimension i of layer is fixed to zero.

Setup
Let P f (s) be the original language modeling distribution of model f for some input s sampled from corpus O. We measure how the distribution changes after ablation using KL divergence between the ablated model distribution and the unaltered reference distribution. 9 We use KL divergence, rather than typical measures of importance in feature ablation such as accuracy or perplexity because we are interested in how much the prediction distributions change rather than performance on some task. Our measure of the importance of dimension i in layer of model f is the mean KL divergence between the two distributions across our corpus, where S is a set of n inputs to the model.

Experiment
To measure the importance of each dimension to model behavior, we compute I(i, , f ) for the last 4 layers of each model over 10k distributions. Since the autoregressive models (GPT-2, XLNet) give a language modeling distribution over all tokens in the input, we use a corpus of 10k tokens from English Wikipedia. In the auto-encoder models (BERT, RoBERTa), we mask 15% of tokens and use a corpus of 150k tokens, for a total of 10k language modeling distributions. We plot the relative behavioral influence of each dimension against its contribution to cosine similarity, measured by CC( f i ), (each is normalized to sum to 1). For example, three dimensions (yellow, red, light yellow) dominate cosine similarity in GPT-2, but when we trace those dimensions to the bottom half of the plot, they appear to vanish, meaning their relative influence on model behavior is negligible. While this mismatch is less pronounced for BERT, it is particularly extreme in XLNet, where a single dimension dominates cosine similarity, but is effectively meaningless to the pretraining objective.

Results
is quite severe in final XLNet and GPT-2, where removing the dimensions which dominate cosine similarity does not lead to substantial changes in the language modeling distribution. While ablating rogue dimensions often alters the language modeling distribution more than ablating non-rogue dimensions, we emphasize that there is not a one-to-one correspondence between a dimension's influence on cosine similarity and its influence on language modeling behavior. In the case of XLNet and GPT-2, removing dimensions which dominate cosine similarity leads to only vanishingly small changes to the behavior of the model.

Behavioral Correlates of Rogue Dimensions
We now turn to the related question of whether rogue dimensions actually capture linguistically meaningful information. Because rogue dimensions dominate representational similarity measures, these measures will be heavily biased toward whatever information these dimensions capture. To explore their behavioral correlates, we plotted the distribution of the values for rogue dimensions. We show in Figure 2 that rogue dimensions often have highly type/position specific activation patterns. Rogue dimensions in all models are particularly sensitive to instances of the "." token and/or position 0 of the input. For example, in laters 2-11 of GPT-2 and RoBERTa, the mean cosine similarity of any two tokens in position 0 is greater than .99, while the mean similarity of tokens not in position 0 is .623 and .564 respectively.
While the transformer language models we have tested have all been shown to capture a rich range of linguistic phenomena, this linguistic knowledge may be obscured by rogue dimensions. The following section empirically evaluates this hypothesis.

Postprocessing and Representational Quality
While we have shown that the representational geometry of contextualized embeddings makes cosine similarity uninformative, there are several simple postprocessing methods which can correct for this. In this section we outline three such methods: standardization, all-but-the-top (Mu and Viswanath, 2018), and ranking (via Spearman correlation).
We evaluate representational quality of the postprocessed embeddings on several word similarity/relatedness datasets and show that the underlying representational quality is obscured by the rogue dimensions. When we correct for rogue dimensions, correlation with human similarity judgments improves across the board. We also find that representational quality is preserved across all layers, rather than giving way to degraded/task specific representations as argued in previous work. Each color corresponds to a specific type/position. The orange distribution is tokens which occur in position zero, the blue distribution is instances of the "." token, and green is instances of all other tokens. Results for all layers can be found in Figures 8 and 9 of the appendix.

Postprocessing
Standardization: We have observed that a small subset of dimensions with means far from zero and high variance completely dominate cosine similarity. A straightforward way to adjust for this is to subtract the mean vector and divide each dimension by its standard deviation, such that each dimension has µ i = 0 and σ i = 1. Concretely, given some corpus of length |O| containing word representations as well as the standard deviation in each dimen- Our new standardized representation for each word vector (z) becomes the z-score in each dimension. z All-but-the-top: Following from similar observations (a nonzero common mean vector and a small number of dominant directions) in static embedding models, Mu and Viswanath (2018) proposed subtracting the common mean vector and eliminating the top few principle components (they suggested the top d 100 ), which should capture the variance of the rogue dimensions in the model 11 and make the space more isotropic.
Spearman a measure of similarity. They propose the use of non-parametric rank correlation coefficients, such as Spearman's ρ when embeddings depart from normality. Spearman correlation is just Pearson correlation but between the ranks of embeddings, rather than their values. Thus Spearman correlation can also be thought of as a postprocessing technique, where instead of standardizing the space or removing the top components, we simply transform embeddings as x = rank(x). Spearman's ρ is robust to outliers and thus will not be dominated by the rogue dimensions of contextual language models. Unlike standardization and all-but-thetop, Spearman correlation requires no computations over the entire corpus. While rank-based similarity measures will not be dominated by rogue dimensions, rogue dimensions will tend to occupy the top or bottom ranks.

Representational Quality
While we have shown that cosine similarity is dominated by a small subset of dimensions, a remaining question is whether adjusting for these dimensions makes similarity measures more informative. In particular, we evaluate whether the cosine similarities between word pairs align more closely with human similarity judgments after postprocessing. We evaluate this using 4 word similarity/relatedness judgment datasets: RG65 (Rubenstein and Goodenough, 1965), WS353 (Agirre et al., 2009), SIMLEX999 (Hill et al., 2015) and SIMVERB3500 (Gerz et al., 2016). Examples in these datasets consist of a pair of words and a corresponding similarity rating averaged over several human annotators. Because the similarity judgments were designed to evaluate static embeddings, we use the context-aggregation strategy of Bommasani et al. (2020) to produce static representations. 12 For each model, we report the Spearman correlation between the model similarities and human-similarity judgments, averaged across all 4 datasets. 13 We report the correlation for cosine similarities of the original embeddings, as well as for postprocessed embeddings using four strategies: standardization, all-but-the-top (removing the top 7 components), only subtracting the mean (the step common to both strategies) and Spearman correlation.

Results
Results are summarized in Figure 3. Our key findings are: Postprocessing aligns the embedding space more closely to human similarity judgments across almost all layers of all models. We found that standardization was the most successful postprocessing method, showing consistent improvement over the original embeddings in all but the early layers of BERT.
All-but-the-top was generally effective, though the resulting final layer of RoBERTa and GPT-2 exhibited poor correlation with human judgements, similar to the original embeddings. In pilot analyses, we found that all-but-the-top is highly dependent on the number of components removed, a hyperparameter, D, which Mu and Viswanath (2018) suggest should be d 100 . Just removing the first principle component in RoBERTa yielded a stronger correlation, but all-but-the top did not significantly improve correlation with human judgements in the final layer of GPT-2 for any choice of D.
Simply subtracting the mean vector also yielded substantial gains in most models with the exception of the final layers of GPT-2 and XLNet. The rogue dimensions in the last layer of these two models have exceptionally high variance. While subtracting the mean made the space more isotropic as measured by cosine similarity, it did not reduce the variance of each dimension. We found, particularly in the final layer of GPT-2 and XLNet that 1-3 dimensions drive the variability of cosine similarity, and this was still the case when the mean vector 12 We aggregate over between 200-500 single-sentence contexts of each word type using sentences from English Wikipedia. Words with an insufficient number of contexts were omitted, leaving a total of 1,894 unique words and 4,577 unique pairs. We use mean pooling over subwords to get a single representation for a word. 13 Full results from each dataset can be seen in Figures 10,  11, 12, 13 of the Appendix. was subtracted.
Converting embeddings into ranks (Spearman correlation) also resulted in significantly stronger correlations with human judgments in all layers of all models, though the correlation was often weaker than standardization or all-but-the-top.
Representational quality is preserved across all layers. Previous work has suggested that the final layers of transformer language models are highly task-specific. Liu et al. (2019) showed that the middle layers of BERT outperform the final layers on language understanding tasks. Using a cosine-similarity based text-generation evaluation metric, Zhang et al. (2020)  Our findings suggest that linguistic representational quality (in this case lexical semantics) is actually preserved in the final layers but is obscured by a small handful of rogue dimensions. After simple postprocessing, later layers of the model correlate just as well, if not better than intermediate layers with human similarity judgments. This finding reaffirms the need to carefully consider the representational geometry of a model before drawing conclusions about layerwise representational quality, and the general linguistic knowledge these models encode.

Discussion and Future Work
Perhaps the most important direction for future work is designing and implementing language models which do not develop rogue dimensions in the first place. Gao et al. (2019) introduce a cosineregularization term during pretraining which improved the performance of transformer models on machine translation. Perhaps BERT or GPT models could similarly benefit from such regularization.
A prerequisite for designing models without rogue dimensions is understanding how these dimensions arise over time. Contemporaneous work from Biś et al. (2021) provides a useful characterization of how degenerate representations may be In the present work, we observe strong correlations with specific tokens and positions. Unifying these accounts is an important task for future work. With the recent release of the MultiBERT checkpoints (Sellam et al., 2021), future work can uncover whether rogue dimensions are a coincidental property of some models, or whether they are a requisite for good performance. The Multi-BERTs may also elucidate how these dimensions emerge during pretraining. While we empirically motivate a trivially simple transformation which corrects for rogue dimensions, we believe the most fruitful direction for future work is to build models whose representations require no post-hoc transformations. This would result in more interpretable embedding spaces and may additionally lead to models with better performance.

Conclusion
In this work, we showed that similarity measures in contextual language models are largely reflective of a small number of rogue dimensions, not the entire embedding space. Consequently, a few dimensions can drastically change the conclusions we draw about the linguistic phenomena a model actually captures. We showed that the previously observed anisotropy in contextual models is essentially an artifact of rogue dimensions and is not a global property of the entire embedding space. We also showed that variability in similarity is driven by just 1-5 dimensions of the embedding space. In many cases, removing just a single dimension completely changed which token pairs were similar to one another. However, we found that model behavior was not driven by these rogue dimensions, and that these dimensions seem to handle a small subset of a model's linguistic abilities, such as punctuation and positional information. In summary, standard similarity measures such as cosine similarity and Euclidean distance are not informative measures of how contextual language models represent and process language. We argue that measures of similarity in contextual language models must account for rogue dimensions using techniques such as standardization. These techniques should not just be viewed as avenues to improve downstream performance, but as prerequisites for any analysis involving representational similarity.

A Removing Dominant Dimensions and Representational Geometry
To facilitate a direct comparison with anisotropy estimates of Ethayarajh (2019), we replicate the experiments of Section 4 before and after removing the top k dimensions with the largest E[CC i ].
For these experiments we chose k=5 dimensions to remove. Results for anisotropy estimates are shown in Figure 4. Three key takeaways from this analysis are: All models tested had highly anisotropic representations, including XLNet and RoBERTa which had not been evlauated in previous work. XLNet is even more anistropic than GPT-2 in its final two layers. RoBERTa's word representations are likewise highly anisotropic, though starting in earlier layers than in XLNet and BERT.
After removing just 5 dimensions, embeddings become relatively isotropic, withÂ( f ) never larger than 0.25 in any layer of any model.
Anisotropy becomes consistent across models and across layers, suggesting that the deviant dimensions that drive anisotropy are idiosyncratic and model/layer specific; we show this to indeed be the case in Section 4. By contrast, the geometry of the embedding space without rogue dimensions show similar properties across models/layers, suggesting that the similar qualities of the representational geometries of each model are obscured by these rogue dimensions.
This can additionally be seen in our replication of the intra-sentence similarity and self-similarity from Ethayarajh (2019). While they find extreme cases in which words of the same type are no more similar to one another than randomly sampled words, we find a consistently high degree of self-similarity across all layers of all models after removing 5 dimensions. This suggests that information about word identity is preserved across all layers, rather than giving way to extremely contextualized representations in the final layer, this concurs with our findings in Section 5. Together, these show that our conclusions about the geometry of contextual embedding spaces are heavily skewed by the sensitivity of cosine similarity to rogue dimensions present in each of these models.

B Informativity of Euclidean Distance
In this section, we conduct a similar analysis to Section 3.2 to see whether the variability in Euclidean distances between pairs of embeddings can be explained by Euclidean distance with the top k dimensions are removed. Our methods for this analysis are identical to those of Section 3.2, except our criterion for choosing k is the variance in each dimension. Results are shown in Table 3. In the extreme case of XLNet, none of the variability in Euclidean distances can be explained by Euclidean distances when a single dimension is removed. This means that Euclidean distance in this layer is effectively a measure of a single dimension.   : Average self-similarity (similarity of the same word type across contexts) by layer of the full embedding space (left) and with the top 5 dimensions removed, as measured by E[CC i ] (right). In the full embedding space, words of the same type in GPT-2 and XLNet appear no more similar to one another than randomly-sampled tokens. When we remove just 5 dimensions, words of the same type are indeed more similar to one another than the random baseline.  Each color corresponds to a specific type/position, where the orange distribution is tokens occurring in position zero, the blue distribution is instances of the "." token, and green is all other tokens. In many cases, there are two clear modes in each distribution, where one corresponds to a specific word type or position. Additionally, this behavior tends to persist within the same dimension number across layers, which is facilitated by the residual connections present in each model. Figure 9: Distribution of activations in the dimension with highest variance in layers 7-12 of each model across a sample of 10k tokens. Each color corresponds to a specific type/position, where the orange distribution is tokens occurring in position zero, the blue distribution is instances of the "." token, and green is all other tokens. In many cases, there are two clear modes in each distribution, where one corresponds to a specific word type or position. Additionally, this behavior tends to persist within the same dimension number across layers, which is facilitated by the residual connections present in each model.     Table 5: Proportion of variance in cosine similarity r 2 explained by cosine similarity when the top k dimensions (measured by cosine similarity contribution) are removed. Layer 0 is the static embedding layer.