“Average” Approximates “First Principal Component”? An Empirical Analysis on Representations from Neural Language Models

Contextualized representations based on neural language models have furthered the state of the art in various NLP tasks. Despite its great success, the nature of such representations remains a mystery. In this paper, we present an empirical property of these representations—”average” approximates “first principal component”. Specifically, experiments show that the average of these representations shares almost the same direction as the first principal component of the matrix whose columns are these representations. We believe this explains why the average representation is always a simple yet strong baseline. Our further examinations show that this property also holds in more challenging scenarios, for example, when the representations are from a model right after its random initialization. Therefore, we conjecture that this property is intrinsic to the distribution of representations and not necessarily related to the input structure. We realize that these representations empirically follow a normal distribution for each dimension, and by assuming this is true, we demonstrate that the empirical property can be in fact derived mathematically.


Introduction
A large variety of state-of-the-art methods in NLP tasks nowadays are built upon contextualized representations from pre-trained neural language models, such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), and XLNet (Yang et al., 2019). Despite the great success, we lack understandings about the nature of such representations. For example, Aharoni and Goldberg (2020) have shown that averaging BERT representations in a sentence can preserve its domain information. However, to our best knowledge, there is no analysis on what leads to the power of averaging representations. * * Jingbo Shang is the corresponding author.

Input :
Average Figure 1: Visualization of our discovered empirical property: "average" ≈ "first principal component". In this work, we present an empirical property of these representations, "average" ≈ "first principal component". As shown in Figure 1, given a sequence of L tokens, one can construct a d × L matrix R using each d-dimensional representation r i of the i-th token as a column. There are two popular ways to project this matrix into a single d-dimensional vector: (1) average and (2) first principal component. Formally, the average r is a ddimensional vector where r = L i=1 r i /L. The first principal component p is a d-dimensional vector whose direction maximizes the variance of the (mean-shifted) L representations. Then, the property can be written as |cos(r, p)| ≈ 1. This absolute value is more than 0.999 in our experiments.
We examine the generality of this property and find it also holds in three more scenarios when every r i is drawn from (1) a fixed layer (not necessary the last layer) in a pre-trained neural language model, (2) a fixed layer in a model right after random initialization without any training, and (3) random token representations from all sentences encoded by a pre-trained model. Therefore, we conjecture that this property is intrinsic to the representations' distribution, which is related to the neural language model's architecture and parameters, and not necessarily related to the input structure. We realize that the empirical distribution of these representations is similar to a normal distribution on each dimension. Assuming this is true, we show that the property can be in fact derived mathematically.
Our contributions are summarized as follow. • We discover a common, insightful property of several pre-trained neural language models-"average" ≈ "first principal component". To some extent, this explains why the average representation is always a simple yet strong baseline. • We verify the generality of this property by obtaining representations from a random mixture of layers and sentences and also using randomly initialized models instead of pre-trained ones. • We show that representations from language models empirically follow a per-dimension normal distribution that leads to the property. Reproducibility. We will release code to reproduce experiments on Github 1 .

Experimental Settings
Dataset. We random sample 4,000 sentences each from three different datasets on three different domains: AG's news corpus (Zhang et al., 2015), KP20k Computer Science papers (Meng et al., 2017), and DBpedia (Zhang et al., 2015). Pre-trained Neural Language Models. We experiment on four well-known language models: 3 The Property: "Average" ≈ "First Principal Component" In most applications, each representation r i in R comes from the tokens within the Same Sentence and the last layer of a pre-trained neural language model. Following this setting, we conduct 4,000 tests each on three datasets and summarize the results in Table 1. One can easily see that the average and min absolute cosine similarities are very close to 1 for all pre-trained neural language models. The word embeddings satisfy the property on average, but not for some outlier sentences 2 . Given that uniformly random representations have nearzero average and min absolute cosine similarity values, we conclude that this is a special property for the language model generated representations.
To some extent, it explains the effectiveness of the average last-layer representation based on a language model, which has been widely adopted and observed in the literature.

Generality Tests of the Property
Different Layers. To evaluate our discovered property's generality, we first investigate if this property only holds for the last-layer representations. For the four transformer-based language models, there are 13 possible layers (i.e., one after lookup table and 12 after encoder/decoder layers) to retrieve representations for tokens. Therefore, we test the property based on representations from each layer and plot the average absolute cosine similarities in Figure 2. One can see that the property holds for the last few layers in all four models. Random Initialized Models. We repeat the same test for randomly initialized models, i.e., not (pre-)trained at all. The results are in Figure 2. Again, we can see that the property holds for the last few layers in all four models. Random Sentence. Finally, we explore the case when the representations can even come from different sentences. Specifically, we shuffle all the last-layer token representations of the 4,000 sentences and re-group them into 4,000 random lists of representations. With a high probability, each token representation in a list is generated independently of other tokens from the same list. We show the results in Random Sentences section in Table 2. Surprisingly, even with "unrelated" token representations, the property still holds well.

Analysis
In this section, we attempt to answer what could be a reason that the language models show this property. From Section 4, we know that the property also holds for randomly initialized models.
Such models know nothing about natural languages. Therefore, it is reasonable to believe that this property is intrinsic to the models and related to the distribution of these representations.

Representation Distribution Analysis: BERT as a Case Study
We show that each dimension of BERT representations likely follows a normal distribution. From Figure 3, we can see that the quantiles match with a normal distribution almost perfectly through a Q-Q plot (Wilk and Gnanadesikan, 1968) on the first dimension. We have checked another ten random dimensions and their quantiles all match well (see Appendix).
We also compare the skewness and kurtosis of a standard normal distribution and the empirical distribution of standardized representation values in each dimension. Let s j be the vector that contains values of dimension j in the representations. Specifically, consider the representation matrix R for all D = 224, 970 representations over the 4,000 sentences. The rows of R correspond to s j . The standardized vector s j of s j is . For each dimension j, 1 ≤ j ≤ d, one can obtain an empirical distribution from s j . From Table 3, the third moment matches with a standard normal distribution well, while the fourth moment is a bit off. Further, we examine the the off diagonal terms in the d × d covariance matrix of the representations, which has a mean of 0.0101 and a standard deviation of 0.0116. When compared with a mean of 0.1747 of the diagonal terms, this is very small. Therefore, we conjecture that each dimension of BERT's representation can be treated approximately like an independent normal distribution. We note that we do not perform normality tests due to the large dataset size (i.e., over 200,000 representations), since even a minor shift away from the normal distribution can make statistical tests reject the null hypothesis.
In the rest of this section, we assume representations are sampled from d normal distributions, i.e., each dimension follows a distribution N (µ j , σ 2 j ).

Fitted distributions satisfy the property
We verify the property on generated representations following the distribution. When the parameters µ j , σ j are estimated from representations from language models, the property holds (see Appendix).
We can also randomly sample the parameters from pre-defined distributions, as shown in Table 4. The results on pre-defined distributions tell us: (1) the average of all µ j should be 0, (2) not all of µ j should be exactly 0, and (3) the variance should not be too large in magnitude compared to the mean. In the following analysis, we additionally restrict that all representations have a sum of value to 0, i.e. d j=1 r ij = 0, for all representations r i . This is mainly for the simplicity of the covariance matrix computation, as the PCA algorithm will first meanshift the R matrix.

Covariance Matrix C of Normally Distributed Representations
We define the L-by-L covariance matrix C = R R. Its L-by-1 eigenvector w corresponding to the largest eigenvalue can be used to get the first principal component, i.e., p = Rw. We show that if the representations follow a perdimension normal distribution, C will follow a special shape-by expectation, its diagonals and off-diagonals will be the same positive value, respectively. We theoretically derive the mean and standard deviation of the entries based on µ j and σ j (derivations are available in Appendix), empirically estimate their values, and put them in Table 5. It is clear that the standard deviation is smaller than the mean in magnitudes, confirming the special shape of C. Also, the theoretical and estimated values mostly match. The only significant difference is the standard deviation for diagonal entries, which is due to the difference on the fourth power statistics between the representations and the standard normal distribution as shown in Table 3.

This Special C → the Property
If the diagonal entries of the covariance matrix C are a > 0, and all off-diagonal entries are b > 0, the eigenvector w corresponding to the largest eigenvalue will be a uniform vector. The Perron-Frobenius theorem (Samelson, 1957) states that the (unique) largest eigenvalue λ is bounded: which refer to the min and max row-sums in C. Due to its special shape, all row-sums in C are around a + b * (L − 1). Therefore, the largest eigenvalue λ 1 ≈ a + b * (L − 1). To obtain w, one can solve Cw = λ 1 w. Obviously, w = 1 is a solution, where 1 is a vector of 1's of length L. As a result, the first principal component p = Rw follows the same direction as the average.
There are other attempts to analyze properties of language models. Clark et al. (2019) analyze syntactic information that BERT's attention maps capture. K et al. (2020) prune the causes for multilinguality of multilingual BERT. Wang and Chen (2020) show that position information are learned differently in different language models. Different from these language-specific properties, we believe our newly discovered property relates more to the internal structure of neural language models.

Conclusion and Future Work
This paper shows a common, insightful property of representations from neural language models-"average" ≈ "first principal component". This property is general and holds in many challenging scenarios. After analyzing the BERT representations as a case study, we conjecture that these representations follow a normal distribution for each dimension, and this distribution leads to our discovered property. We believe that this work can shed light on future directions: (1) identifying the distributions that representations from language models follow, and (2)

A Q-Q Plot of Ten Random Dimensions
We randomly sample another 10 dimensions from the 768 dimensions of BERT and plot the quantiles against a normal distribution in Figure 4. All the 10 dimensions match with a normal distribution pretty well.

B Normal Distribution Estimated from Models
In additional to randomly sampled µ j and σ j , we can also use the empirical mean and standard deviation of (dimensions of) representations from pretrained language models. Table 6 shows that the property is well satisfied on these representations. This further advocates that representations from these models have properties similar to normal distributions.

C Diagonal & Off-diagonal Values
Here we show the calculations for values in the covariance matrix C. Note that so for diagonal entries C ii is a sum of d products of normally distributed random variables with itself, and all C ii follow the same distribution; for off diagonal entries C ij is a sum of d products of pairs of normally distributed random variables, and similarly, all off diagonal entries also follow the same distribution. Therefore, on expectation, the covariance matrix have the same diagonal entries, and the same off-diagonal entries. The average and variance can be mathematically derived: We also outline steps for the derivation. Following our notations, r ij ∼ N (µ j , σ 2 j ) =⇒ r ij = σ j * z ij +µ j where z ij is a standard normal variable, i.e. z ij ∼ N (0, 1).