Norm of Word Embedding Encodes Information Gain

Distributed representations of words encode lexical semantic information, but what type of information is encoded and how? Focusing on the skip-gram with negative-sampling method, we found that the squared norm of static word embedding encodes the information gain conveyed by the word; the information gain is defined by the Kullback-Leibler divergence of the co-occurrence distribution of the word to the unigram distribution. Our findings are explained by the theoretical framework of the exponential family of probability distributions and confirmed through precise experiments that remove spurious correlations arising from word frequency. This theory also extends to contextualized word embeddings in language models or any neural networks with the softmax output layer. We also demonstrate that both the KL divergence and the squared norm of embedding provide a useful metric of the informativeness of a word in tasks such as keyword extraction, proper-noun discrimination, and hypernym discrimination.


Introduction
The strong connection between natural language processing and deep learning began with word embeddings (Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2017;Schnabel et al., 2015).Even in today's complex models, each word is initially converted into a vector in the first layer.One of the particularly interesting empirical findings about word embeddings is that the norm represents the relative importance of the word while the direction represents the meaning of the word (Schakel and Wilson, 2015;Khodak et al., 2018;Arefyev et al., 2018;Pagliardini et al., 2018;Yokoi et al., 2020).
This study focuses on the word embeddings obtained by the skip-gram with negative sampling (SGNS) model (Mikolov et al., 2013).We show theoretically and experimentally that the Euclidean  In Bayesian inference, the expected KL divergence is called information gain.In this context, the prior distribution is p(•), and the posterior distribution is p(•|w).The information gain represents how much information we obtain about the context word distribution when observing w.Table 1 shows that the 10 highest values of KL(w) are given by contextspecific informative words, while the 10 lowest values are given by context-independent words.Fig. 1 shows that ∥u w ∥ 2 is almost linearly related to KL(w); this relationship holds also for a larger corpus of Wikipedia dump as shown in Appendix G.We prove in Section 4 that the square of the norm of the word embedding with a whitening-like transformation approximates the KL divergence 1 .The main results are explained Figure 1: Linear relationship between the KL divergence and the squared norm of word embedding for the text8 corpus computed with 100 epochs.The color represents word frequency n w .Plotted for all vocabulary words, but those with n w < 10 were discarded.A regression line was fitted to words with n w > 10 3 .Other settings are explained in Section 4.2 and Appendix A.
Empirically, the KL divergence, and thus the norm of word embedding, are helpful for some NLP tasks.In other words, the notion of information gain, which is defined in terms of statistics and information theory, can be used directly as a metric of informativeness in language.We show this through experiments on the tasks of keyword extraction, proper-noun discrimination, and hypernym discrimination in Section 7.
In addition, we perform controlled experiments that correct for word frequency bias to strengthen the claim.The KL divergence is heavily influenced by the word frequency n w , the number of times that word w appears in the corpus.Since the corpus size is finite, although often very large, the KL divergence calculated from the co-occurrence matrix of the corpus is influenced by the quantization error and the sampling error, especially for low-frequency words.The same is also true for the norm of word embedding.This results in bias due to word frequency, and a spurious relationship is observed between word frequency and other quantities.Therefore, in the experiments, we correct the word frequency bias of the KL divergence and the norm of word embedding.
The contributions of this paper are as follows: sures other than KL divergence are referred to Appendix B. The KL divergence is more strongly related to the norm of word embedding than the Shannon entropy of the cooccurrence distribution (Fig. 7) and the self-information − log p(w) (Fig. 8).
• We showed theoretically and empirically that the squared norm of word embedding obtained by the SGNS model approximates the information gain of a word defined by the KL divergence.Furthermore, we have extended this theory to encompass contextualized embeddings in language models.
• We empirically showed that the bias-corrected KL divergence and the norm of word embedding are similarly good as a metric of word informativeness.
After providing related work (Section 2) and theoretical background (Section 3), we prove the theoretical main results in Section 4. In Section 5, we extend this theory to contextualized embeddings.We then explain the word frequency bias (Section 6) and evaluate KL(w) and ∥u w ∥ 2 as a metric of word informativeness in the experiments of Section 7.
2 Related work

Norm of word embedding
Several studies empirically suggest that the norm of word embedding encodes the word informativeness.According to the additive compositionality of word vectors (Mitchell and Lapata, 2010), the norm of word embedding is considered to represent the importance of the word in a sentence because longer vectors have a larger influence on the vector sum.Moreover, it has been shown in Yokoi et al. (2020) that good performance of word mover's distance is achieved in semantic textual similarity (STS) task when the word weights are set to the norm of word embedding, while the transport costs are set to the cosine similarity.Schakel and Wilson (2015) claimed that the norm of word embedding and the word frequency represent word significance and showed experimentally that proper nouns have embeddings with larger norms than function words.Also, it has been experimentally shown that the norm of word embedding is smaller for less informative tokens (Arefyev et al., 2018;Kobayashi et al., 2020).

Metrics of word informativeness
Keyword extraction.Keywords are expected to have relatively large amounts of information.Keyword extraction algorithms often use a metric of the "importance of words in a document" calculated by some methods, such as TF-IDF or word co-occurrence (Wartena et al., 2010).Matsuo and Ishizuka (2004) showed that the χ2 statistics computed from the word co-occurrence are useful for keyword extraction.The χ 2 statistic is closely related to the KL divergence (Agresti, 2013) since χ 2 statistic approximates the likelihood-ratio chisquared statistic G 2 = 2n w KL(w) when each document is treated as a corpus.
Hypernym discrimination.The identification of hypernyms (superordinate words) and hyponyms (subordinate words) in word pairs, e.g., cat and munchkin, has been actively studied.Recent unsupervised hypernym discrimination methods are based on the idea that hyponyms are more informative than hypernyms and make discriminations by comparing a metric of the informativeness of words.Several metrics have been proposed, including the KL divergence of the co-occurrence distribution to the unigram distribution (Herbelot and Ganesalingam, 2013), the Shannon entropy (Shwartz et al., 2017), and the median entropy of context words (Santus et al., 2014).
Word frequency bias.Word frequency is a strong baseline metric for unsupervised hypernym discrimination.Discriminations based on several unsupervised methods with good task performance are highly correlated with those based simply on word frequency (Bott et al., 2021).KL divergence achieved 80% precision but did not outperform the word frequency (Herbelot and Ganesalingam, 2013).WeedsPrec (Weeds et al., 2004) and SLQS Row (Shwartz et al., 2017) correlate strongly with frequency-based predictions, calling for the need to examine the frequency bias in these methods.

Theoretical background
In this section, we describe the KL divergence (Section 3.2), the probability model of SGNS (Section 3.3), and the exponential family of distributions (Section 3.4) that are the background of our theoretical argument in the next section.

Preliminary
Probability distributions.We denote the probability of a word w in the corpus as p(w) and the unigram distribution of the corpus as p(•).Also, we denote the conditional probability of a word w ′ co-occurring with w within a fixed-width window as p(w ′ |w), and the co-occurrence distribution as p(•|w).Since these are probability distributions, w∈V p(w) = w ′ ∈V p(w ′ |w) = 1, where V is the vocabulary set of the corpus.The frequencyweighted average of p(•|w) is again the unigram distribution p(•), that is, (1) Embeddings.SGNS learns two different embeddings with dimensions d for each word in V : word embedding u w ∈ R d for w ∈ V and context embedding v w ′ ∈ R d for w ′ ∈ V .We denote the frequency-weighted averages of u w and v w ′ as We also use the centered vectors

KL divergence measures information gain
The distributional semantics (Harris, 1954;Firth, 1957) suggests that "similar words will appear in similar contexts" (Brunila and LaViolette, 2022).This implies that the conditional probability distribution p(•|w) represents the meaning of a word w.
The difference between p(•|w) and the marginal distribution p(•) can therefore capture the additional information obtained by observing w in a corpus.
A metric for such discrepancies of information is the KL divergence of p(•|w) to p(•), defined as In this paper, we denote it by KL(w) and call it the KL divergence of word w.Since p(•) is the prior distribution and p(•|w) is the posterior distribution given the word w, KL(w) can be interpreted as the information gain of word w (Oladyshkin and Nowak, 2019).Since the joint distribution of w ′ and w is p(w ′ , w) = p(w ′ |w)p(w), the expected value of KL(w) is expressed as This is the mutual information I(W ′ , W ) of the two random variables W ′ and W that correspond to w ′ and w, respectively 2 .I(W ′ , W ) is often called information gain in the literature.

The probability model of SGNS
The SGNS training utilizes the Noise Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2012) to distinguish between p(•|w) and the negative sampling distribution q(•) ∝ p(•) 3/4 .For each co-occurring word pair (w, w ′ ) in the corpus, ν negative samples {w ′′ i } ν i=1 are generated, and we aim to classify the ν + 1 samples {w ′ , w ′′ 1 , . . ., w ′′ ν } as either a positive sample generated from w ′ ∼ p(w ′ |w) or a negative sample generated from w ′′ ∼ q(w ′′ ).The objective of SGNS (Mikolov et al., 2013) involves computing the probability of w ′ being a positive sample using a kind of logistic regression model, which is expressed as follows (Gutmann and Hyvärinen, 2012): To gain a better understanding of this formula, we can cross-multiply both sides of (3) by the denominators: and rearrange it to obtain: We assume that the co-occurrence distribution satisfies the probability model (4).This is achieved when the word embeddings {u w } and {v w ′ } perfectly optimize the SGNS's objective, whereas it holds only approximately in reality.

Exponential family of distributions
We can generalize (4) by considering an instance of the exponential family of distributions (Lehmann and Casella, 1998;Barndorff-Nielsen, 2014;Efron, 2022), given by where u ∈ R d is referred to as the natural parameter vector, v w ′ ∈ R d represents the sufficient statistics (treated as constant vectors here, while tunable parameters in SGNS model), and the normalizing function is defined as The SGNS model ( 4) is interpreted as a special case of the exponential family for u = u w with constraints ψ(u w ) = − log ν for w ∈ V ; the model ( 5) is a curved exponential family when the parameter value u is constrained as ψ(u) = − log ν, but we do not assume it in the following argument.This section outlines some well-known basic properties of the exponential family of distributions, which have been established in the literature (Barndorff-Nielsen, 2014;Efron, 1978Efron, , 2022;;Amari, 1982).For ease of reference, we provide the derivations of these basic properties in Appendix J.
The expectation and the covariance matrix of v w ′ with respect to w ′ ∼ p(w ′ |u) are calculated as the first and second derivatives of ψ(u), respectively.Specifically, we have The KL divergence of p(•|u 1 ) to p(•|u 2 ) for two parameter values u 1 , u 2 ∈ R d is expressed as The KL divergence is interpreted as the squared distance between two parameter values when they are not very far from each other.In fact, the KL divergence (8) is expressed approximately as for i = 1, 2. Here, the equation holds approximately by ignoring higher order terms of O(∥u 1 − u 2 ∥ 3 ).For more details, refer to Amari (1982, p. 369), Efron (2022, p. 35).More generally, G(u) is the Fisher information metric, and (9) holds for a wide class of probability models (Amari, 1998).

Squared norm of word embedding approximates KL divergence
In this section, we theoretically explain the linear relationship between KL(w) and ∥u w ∥ 2 observed in Fig. 1 by elaborating on additional details of the exponential family of distributions (Section 4.1) and experimentally confirm our theoretical results (Section 4.2).

Derivation of theoretical results
We assume that the unigram distribution is represented by a parameter vector u 0 ∈ R d and p(w ′ ) = p(w ′ |u 0 ). (10) By substituting u 1 and u 2 with u w and u 0 respectively in (9), we obtain Here G := G(u 0 ) is the covariance matrix of v w ′ with respect to w ′ ∼ p(w ′ ), and we can easily compute it from (7) as 2) and ( 6).However, it is important to note that the value of u 0 is not trained in practice, and thus we need an estimate of u 0 to compute u w − u 0 on the right-hand side of (11).
We argue that u w − u 0 in (11) can be replaced by u w − ū = ûw so that For a formal derivation of (12), see Appendix K. Intuitively speaking, ū approximates u 0 , because ū corresponds to p(•) in the sense that ū is the weighted average of u w as seen in (2), while p(•) is the weighted average of p(•|u w ) as seen in (1).
To approximate u 0 , we could also use u w of some representative words instead of using ū.We expect u 0 to be very close to some u w of stopwords such as 'a' and 'the' since their p(•|u w ) are expected to be very close to p(•).
Let us define a linear transform of the centered embedding as i.e., the whitening of u w with the context embedding3 , then ( 12) is now expressed4 as Therefore, the square of the norm of the word embedding with the whitening-like transformation in (13) approximates the KL divergence.Figure 3: Confirmation of ( 12) and ( 14).The slope coefficient of 1.384, which is close to 1, suggests the validity of the theory.

Experimental confirmation of theory
The theory explained so far was confirmed by an experiment on real data.
Settings.We used the text8 corpus (Mahoney, 2011) with the size of N = 17.0 × 10 6 tokens and |V | = 254 × 10 3 vocabulary words.We trained 300-dimensional word embeddings (u w ) w∈V and (v w ′ ) w ′ ∈V by optimizing the objective of SGNS model (Mikolov et al., 2013).We also computed the KL divergence (KL(w)) w∈V from the co-occurrence matrix.These embeddings and KL divergence are used throughout the paper.See Appendix A for the details of the settings.
Details of Fig. 1.First, look at the plot of KL(w) and ∥u w ∥ 2 in Fig. 1 again.Although u w are raw word embeddings without the transformation (13), we confirm good linearity ∥u w ∥ 2 ∝ KL(w).A regression line was fitted to words with n w > 10 3 , where low-frequency words were not very stable and ignored.The coefficient of determination R 2 = 0.831 indicates a very good fitting.
Adequacy of theoretical assumptions.In Fig. 1, the minimum value of KL(w) is observed to be very close to zero.This indicates that p(•|w) for the most frequent w is very close to p(•) in the corpus, and that the assumption (10) in Section 4.1 is adequate.
Confirmation of the theoretical results.To confirm the theory stated in (11), we thus estimated u 0 as the frequency-weighted average of word vectors corresponding to the words {the, of, and}.These three words were selected as they are the top three words in the word frequency n w .Then the correctness of ( 11) was verified in Fig. 2, where the slope coefficient is much closer to 1 than 0.048 of Fig. 1.
Similarly, the fitting in Fig. 3 confirmed the theory stated in ( 12) and ( 14), where we replaced u 0 by ū.
Experiments on other embeddings.In Appendix G, the theory was verified by performing experiments using a larger corpus of Wikipedia dump (Wikimedia Foundation, 2021).In Appendix H, we also confirmed similar results using pre-trained fastText (Bojanowski et al., 2017) and SGNS (Li et al., 2017) embeddings.

Contextualized embeddings
The theory developed for static embeddings of the SGNS model is extended to contextualized embeddings in language models, or any neural networks with the softmax output layer.

Theory for language models
The final layer of language models with weights and the probability of choosing the word w ′ ∈ V is calculated by the softmax function w∈V e yw . (15) Comparing ( 15) with ( 5), the final layer is actually interpreted as the exponential family of distributions with q(w ′ ) = e b w ′ / w∈V e bw so that p softmax (w ′ |u) = p(w ′ |u).Thus, the theory for SGNS based on the exponential family of distributions should hold for language models.However, we need the following modifications to interpret the theory.Rather than representing the co-occurrence distribution, p(•|u) now signifies the word distribution at a specific token position provided with the contextualized embedding u.Instead of the frequency-weighted average ū = w∈V p(w)u w , we redefine ū := N i=1 u i /N as the average over the contextualized embeddings {u i } N i=1 calculated from the training corpus of the language model.Here, u i denotes the contextualized embedding computed for the i-th token of the training set of size N .The information gain of contextualized embedding u is With these modifications, all the arguments presented in Sections 3.4 and 4.1, along with their respective proofs, remain applicable in the same manner (Appendix L), and we have the main result ( 14) extended to contextualized embeddings as where the contextualized version of the centering and whitening are expressed as û := u − ū and ũ := G 1 2 û, respectively.including those for BERT and GPT-2, as well as additional details, are described in Appendix I.While not as distinct as the result from SGNS in Fig. 1, it was observed that the theory suggested by ( 16) approximately holds true in the case of contextualized embeddings from language models.

Word frequency bias in KL divergence
The KL divergence is highly correlated with word frequency.In Fig. 5, 'raw' shows the plot of KL(w) against n w .The KL divergence tends to be larger for less frequent words.A part of this tendency represents the true relationship that rarer words are more informative and thus tend to shift the co-occurrence distribution from the corpus distribution.However, a large part of the tendency, particularly for low-frequency words, comes from the error caused by the finite size N of the corpus.This introduces a spurious relationship between KL(w) and n w , causing a direct influence of word frequency.The word informativeness can be better measured by using the KL divergence when this error is adequately corrected.

Estimation of word frequency bias
Preliminary.The word distributions p(•) and p(•|w) are calculated from a finite-length corpus.
The observed probability of a word w is p(w) = n w /N , where N = w∈V n w .The observed probability of a context word w ′ co-occurring with w is p(w ′ |w) = n w,w ′ / w ′′ ∈V n w,w ′′ , where (n w,w ′ ) w,w ′ ∈V is the co-occurrence matrix.We computed n w,w ′ as the number of times that w ′ appears within a window of ±h around w in the corpus.Note that the denominator of p(w ′ |w) is w ′′ ∈V n w,w ′′ = 2hn w if the endpoints of the corpus are ignored.
Sampling error ('shuffle').Now we explain how word frequency directly influences the KL divergence.Consider a randomly shuffled corpus, i.e., words are randomly reordered from the original corpus (Montemurro and Zanette, 2010;Tanaka-Ishii, 2021).The unigram information, i.e., n w and p(•), remains unchanged after shuffling the corpus.On the other hand, the bigram information, i.e., n w,w ′ and p(•|w), computed for the shuffled corpus is independent of the co-occurrence of words in the original corpus.In the limit of N → ∞, p(•|w) = p(•) holds and KL(w) = 0 for all w ∈ V in the shuffled corpus.For finite corpus size N , however, p(•|w) deviates from p(•) because (n w,w ′ ) w ′ ∈V is approximately interpreted as a sample from the multinomial distribution with parameter p(•) and 2hn w .
In order to estimate the error caused by the direct influence of word frequency, we generated 10 sets of randomly shuffled corpus and computed the average of KL(w), denoted as KL(w), which is shown as 'shuffle' in Fig. 5. KL(w) does not convey the bigram information of the original corpus but does represent the sampling error of the multinomial distribution.For sufficiently large N , we expect KL(w) ≈ 0 for all w ∈ V .However, KL(w) is very large for small n w in Fig. 5.
Sampling error ('lower 3 percentile').Another computation of KL(w) faster than 'shuffle' was also attempted as indicated as 'lower 3 percentile' in Fig. 5.This represents the lower 3-percentile point of KL(w) in a narrow bin of word frequency n w .First, 200 bins were equally spaced on a logarithmic scale in the interval from 1 to max(n w ).Next, each bin was checked in order of decreasing n w and merged so that each bin had at least 50 data points.This method allows for faster and more robust computation of KL(w) directly from KL(w) of the original corpus without the need for shuffling.

Quantization error ('round').
There is another word frequency bias due to the fact that the cooccurrence matrix only takes integer values; it is indicated as 'round' in Fig. 5.This quantization error is included in the sampling error estimated by KL(w), so there is no need for further correction.See Appendix C for details.

Correcting word frequency bias
We simply subtracted KL(w) from KL(w).The sampling error KL(w) was estimated by either 'shuffle' or 'lower 3 percentile'.We call as the bias-corrected KL divergence.The same idea using the random word shuffling has been applied to an entropy-like word statistic in an existing study (Montemurro and Zanette, 2010).

Experiments
In the experiments, we first confirmed that the KL divergence is indeed a good metric of the word informativeness (Section 7.1).Then we confirmed that the norm of word embedding encodes the word informativeness as well as the KL divergence (Section 7.2).Details of the experiments are given in Appenices D, E, and F. As one of the baseline methods, we used the Shannon entropy of p(•|w), defined as It also represents the information conveyed by w as explained in Appendix B.  6 and 7, respectively, in Appendix D.

KL divergence represents the word informativeness
Through keyword extraction tasks, we confirmed that the KL divergence is indeed a good metric of the word informativeness.
Settings.We used 15 public datasets for keyword extraction for English documents.Treating each document as a "corpus", vocabulary words were ordered by a measure of informativeness, and Mean Reciprocal Rank (MRR) was computed as an evaluation metric.When a keyword consists of two or more words, the worst value of rank was used.We used specific metrics, namely 'random', n w , n w H(w) and n w KL(w), as our baselines.These metrics are computed only from each document without relying on external knowledge, such as a dictionary of stopwords or a set of other documents.
For this reason, we did not use other metrics, such as TF-IDF, as our baselines.Note that ∥u w ∥ 2 was not included in this experiment because embeddings cannot be trained from a very short "corpus".
Results and discussions.Table 2 shows that n w KL(w) performed best in many datasets.Therefore, keywords tend to have a large value of n w KL(w), and thus p(•|w) is significantly different from p(•).This result verifies the idea that keywords have significantly large information gain.

Norm of word embedding encodes the word informativeness
We confirmed through proper-noun discrimination tasks (Section 7.2.1) and hypernym discrimination tasks (Section 7.2.2) that the norm of word embedding, as well as the KL divergence, encodes the word informativeness, and also confirmed that correcting the word frequency bias improves it.
In these experiments, we examined the properties of the raw word embedding u w instead of the whitening-like transformed word embedding ũw .From a practical standpoint, we used u w , but experiments using ũw exhibited a similar trend.
Correcting word frequency bias.In the same way as (17), we correct the bias of embedding norm and denote the bias-corrected squared norm as ∆∥u w ∥ 2 := ∥u w ∥ 2 − ∥u w ∥ 2 .We used the 'lower 3 percentile' method of Section 6.1 for ∆∥u w ∥ 2 , because the recomputation of embeddings for the shuffled corpus is prohibitive.Other bias-corrected quantities, such as ∆KL(w) and ∆H(w), were computed from 10 sets of randomly shuffled corpus.

Proper-noun discrimination
Settings.We used 10561 proper nouns, 123 function words, 4771 verbs, and 2695 adjectives that appeared in the text8 corpus not less than 10 times.Table 3: Binary classification of part-of-speech.Values are the ROC-AUC (higher is better).See Fig. 9 in Appendix E for histograms of measures.We used n w , H(w), KL(w), and ∥u w ∥ 2 as a measure for discrimination.The performance of binary classification was evaluated by ROC-AUC.
Results and discussions.Table 3 shows that ∆KL(w) and ∆∥u w ∥ 2 can discriminate proper nouns from other parts of speech more effectively than alternative measures.A larger value of ∆KL(w) and ∆∥u w ∥ 2 indicates that words appear in a more limited context.Fig. 6 illustrates that proper nouns tend to have larger ∆KL(w) and ∆∥u w ∥ 2 values when compared to verbs and function words.
Results and discussions.Table 4 shows that ∆∥u w ∥ 2 and ∆KL(w) were the best and the second best, respectively, for predicting hypernym in hypernym-hyponym pairs.Correcting frequency bias remedies the difficulty of discrimination for the n hyper < n hypo part, resulting in an improvement in the average accuracy.

Conclusion
We showed theoretically and empirically that the KL divergence, i.e., the information gain of the word, is encoded in the norm of word embedding.
The KL divergence and, thus, the norm of word embedding has the word frequency bias, which was corrected in the experiments.We then confirmed that the KL divergence and the norm of word embedding work as a metric of informativeness in NLP tasks.

Limitations
• The important limitation of the paper is that the theory assumes the skip-gram with negative sampling (SGNS) model for static word embeddings or the softmax function in the final layer of language models for contextualized word embeddings.
• The theory also assumes that the model is trained perfectly, as mentioned in Section 3.3.When the assumption is violated, the theory may not hold.For example, the training is not perfect when the number of epochs is insufficient, as illustrated in Appendix G.A Settings for computation of word embeddings and KL divergence Corpus.We used the text8 (Mahoney, 2011), which is an English corpus data with the size of N = 17.0×10 6 tokens and |V | = 254×10 3 vocabulary words.We used all the tokens 6 separated by spaces for word embeddings and KL divergence.
Training of the SGNS model.Word embeddings were trained 7 by optimizing the same objective function used in Mikolov et al. (2013).Parameters used to train SGNS are summarized in Table 5.The learning rate shown is the initial value, which we decreased linearly to the minimum value of 1.0 × 10 −4 during the learning process.The negative sampling distribution was specified as The elements of u w were initialized by the uniform distribution over [−0.5, 0.5] divided by the dimensionality of the embedding, and the elements of v w were initialized by zero. 6We manually checked that the words used in Table 1 and Table 8 were not personally identifiable or offensive. 7We used AMD EPYC 7702 64-Core Processor (64 cores × 2).In this setting, the CPU time is estimated at about 12 hours.Computation of KL divergence.The value of KL(w) was computed from p(•|w) and p(•) using the definition in Section 3.2 with the convention that 0 log 0 = 0.The word probability p(w ′ ) and the co-occurrence probability p(w ′ |w) were computed from the word frequency n w and the cooccurrence matrix (n w,w ′ ) w,w ′ ∈V , respectively, as described in Section 6.The co-occurrence matrix was computed with the window size h = 10.
Word set for visualization.We have used 47 × 10 3 words with n w ≥ 10 1 for the plots of Figs. 1 to 5. Except for Fig. 5, extreme points, up to 0.5% for each axis, were truncated to set the plot range.Word embeddings and KL divergence are not very stable for low-frequency words.For this reason, we used 1820 words with n w > 10 3 to fit the simple linear regression model using the least squares method.

B Other quantities of information theory
In addition to KL divergence, two other information theoretic quantities are discussed here.

B.1 Shannon entropy
The Shannon entropy of p(•|w), defined as also represents information conveyed by w.In this paper, we call it the Shannon entropy of word w.H(w) is closely related to KL(w).The Shannon entropy of p(•|w) can be written as meaning that −H(w) measures how much the cooccurrence distribution shifts from the uniform distribution (i.e., unif(w ′ ) = 1/|V |).Thus, H(w) and KL(w) have different reference distributions.

B.2 Self-information
A much naive way of measuring the information of a word is the self-information of the event that the word w is sampled from p(•), defined as The expected value w∈V p(w)I(w) is the Shannon entropy of p(•).Since p(w) was computed as p(w) = n w /N , I(w) = log N − log n w actually looks at the word frequency n w in the log scale.

B.3 Relation to word embedding
H(w) and I(w) were computed with the same settings as in Section 4.2 and Appendix A. They were plotted along with ∥u w ∥ 2 as shown in Fig. 7 and Fig. 8, respectively.Compared with KL(w), the relationships are less clear with R 2 ≈ 0.4.From this experiment, we see that KL(w) better represents ∥u w ∥ 2 than H(w) and I(w).

C Quantization error
The co-occurrence matrix (n w,w ′ ) w,w ′ ∈V is sparse with many zero values at rows of w with small n w .The effect of quantization error caused by n w,w ′ taking only integer values cannot be ignored for low-frequency words.This effect is part of the sampling error, but we try to isolate the quantization error here.Let us redefine n w,w ′ := round(2hn w p(w ′ )) and compute the KL divergence, denoted as KL 0 (w), which is shown as 'round' in Fig. 5.If there is no rounding errors, p(w ′ |w) = p(w ′ ) so that KL 0 (w) = 0.In reality, however, KL 0 (w) is non-negligible for words with small n w , and this effect can be corrected by KL(w) − KL 0 (w).

D Details of experiment in Section 7.1
In this experiment, we confirmed that humanannotated keywords of documents were observed at the top of the ranking calculated by the discrepancy between p(•|w) and p(•).
Datasets.For the experiment of keyword extraction, we used 15 datasets in English8 .Each entry consists of a pair of document and gold keywords.Table 6 includes information on the size (the number of documents) and the type of documents.
Preparation.Each document in the datasets was tokenized by NLTK's word_tokenize function.Then, each word was stemmed using NLTK's PorterStemmer, and all characters were converted to lowercase.The same preprocessing of stemming and lowercase was also applied to the gold keywords.However, we did not remove stopwords in preprocessing to see if the informativeness measures could remove unnecessary stopwords by themselves.The co-occurrence matrix for each document was computed with the window size h = 10.Note that only a subset V ′ ⊂ V of the vocabulary set described below was used for stable computation of p(w ′ |w), w ′ ∈ V ′ , w ∈ V .For constructing V ′ , all the words w ∈ V were sorted in decreasing order of n w , and the cumulative frequency c i = i j=1 n w j up to the i-th frequent word were computed for i = 1, 2, . . ., |V |.Then V ′ = {w 1 , . . ., w i } was defined with the smallest i such that c i ≥ N/3.Methods.In each document, word ranking lists were created by sorting its vocabulary words using the informativeness measures.For 'random', the ranking list is simply a random shuffle of the vocabulary words.For n w H(w), words were ranked in increasing order.For other measures, words were ranked in decreasing order.We multiply n w to KL(w) because G 2 = 2n w KL(w) is appropriate for testing the null hypothesis that p(•|w) = p(•).n w H(w) is also interpreted as a test statistic for testing the null hypothesis that p(•|w) = unif(•).
We also included the χ 2 statistic (Matsuo and Ishizuka, 2004), which is related to KL(w) as χ 2 ≈ G 2 for sufficiently large n w .
Evaluation metrics.We used MRR and P@5 as evaluation metrics for the keyword prediction task.MRR is the average of the reciprocals of gold keywords' ranks.The numbers in the tables were multiplied by 100.For each document, we used the best-ranked keyword, i.e., the minimum value of the ranks of correct answers.If a keyword is given as a phrase consisting of two or more words, the rank of the keyword is defined by the worstranked word.For example, the rank of "New York" is 10 if the ranks of "new" and "york" are 3 and 10, respectively.P@5 is the average percentage of correct answers that appear in the top five words of the ranked list.For each document, the number of gold keywords in the top five words was computed and divided by 5.For a keyword consisting of two or more words, it is regarded as a correct answer only when all the words are included in the top five words.Thus the percentage can be larger than 100 if several gold keywords share the same words.
Results.Table 6 shows MRR, and Table 7 shows P@5 of the experiment.Datasets were sorted in the increasing order of MRR of the random baseline in both tables.Table 2 in Section 7.1 is a summary of Table 6.Small values of MRR or P@5 of the random baseline indicate the extent of difficulty of the keyword extraction.Datasets with the article type are difficult, and the dataset with the news type is the easiest.In the difficult datasets, n w KL(w) performed best in almost all datasets.

E Details of experiment in Section 7.2.1
In this experiment, we confirmed that proper nouns tend to have larger values of ∆KL(w) and ∆∥u w ∥ compared to other parts of speech.
Datasets.We used 10561 proper nouns, 123 function words, 4771 verbs, and 2695 adjectives that appeared in the text8 corpus not less than 10 times (n w ≥ 10).The parts of speech of these words were identified by NLTK's POS tagger.Proper nouns are tagged as {NN, NNS}, verbs are tagged as {VB, VBD, VBG, VBN, VBP, VBZ}, adjectives are tagged as {JJ, JJS, JJR}, and function words are tagged as {IN, PRP, PRP$, WP, WP$, DT, PDT, WDT, CC, MD, RP}.Proper nouns were restricted to those found in the 61711 words of the English Proper nouns database9 .Preparation.We computed n w , KL(w) and ∥u w ∥ 2 from the text8 corpus as described in Appendix A. H(w) was also computed in the same way as KL(w).For their bias-corrected versions, we used the 'shuffle' method in Section 6.1 for ∆KL(w) and ∆H(w), and the 'lower 3 percentile' method for ∆∥u w ∥ 2 .We used these measures for the binary classification of part-of-speech.
Methods.Proper nouns tend to have large values of n w , KL(w) and ∥u w ∥ 2 , or small values of H(w) as seen in Fig. 9. Therefore, each word is classified as a proper noun if a measure is larger (or smaller) than a threshold value.We performed two sets of binary classification experiments: proper nouns vs. verbs, and proper nouns vs. adjectives.
Evaluation metrics.Since the classification depends on the threshold value, we used ROC-AUC to evaluate the classification performance.ROC-AUC was computed by Scipy's roc_curve function.
Results.Table 3 in Section 7.2.1 shows the ROC-AUC of the classification task, confirming the good performance of ∆KL(w) and ∆∥u w ∥ 2 .
Table 8 shows randomly sampled proper nouns with 10 1 ≤ n w ≤ 10 3 and specific ranges of ∆KL(w); since our experiment is case-insensitive, some selected words were actually considered as common nouns, such as storm and haven.We observed that common nouns tend to have small KL values.On the other hand, words with large KL values include context-specific nouns, such as company names, suggesting that they are more informative.

F Details of experiment in Section 7.2.2
In this experiment, we confirmed that ∆KL(w) and ∆∥u w ∥ 2 tend to have a smaller value for hypernym in hypernym-hyponym pairs.Datasets.Among the hypernym-hyponym pairs in each dataset, we used those consisting of words that appear in the text8 corpus.Specifically, we used 1336 pairs from the 1337 pairs of the BLESS dataset (Baroni and Lenci, 2011), 3635 pairs from the 3637 pairs of the EVALution dataset (Santus et al., 2015), 1760 pairs from the 1933 pairs of the Lenci/Benotto dataset (Lenci and Benotto, 2012), 1427 pairs from the 1427 pairs of the Weeds dataset (Weeds et al., 2014).Each dataset was divided into two parts: the n hyper > n hypo part and the n hyper < n hypo part.
Methods.We considered the binary classification of hypernym given a hypernym-hyponym pair.
Using KL(w), ∥u w ∥ 2 , ∆KL(w), or ∆∥u w ∥ 2 as a measure of informativeness, the word with a smaller value of the measure was predicted as hypernym.
Baseline methods to predict hypernym given a word pair (w 1 , w 2 ) are described below.
• Random is the random classification.The accuracy is 50%.
• Word Frequency chooses the word with larger n w as hypernym.
• WeedsPrec (Weeds and Weir, 2003;Weeds et al., 2004) is based on the distributional inclusion hypothesis that the context of hyponym is included in the context of its hypernym.The weighted inclusion of word w 2 in the context of word w 1 is formulated as WeedsPrec(w 1 , w 2 ) < WeedsPrec(w 2 , w 1 ).
• SLQS (Santus et al., 2014) compares the median entropy of context words defined as w 1 is predicted as hypernym if or equivalently E(w 1 ) > E(w 2 ).Note that C w is the set of most strongly associated context words of w, as determined by positive local mutual information (Evert, 2005).We used |C w | = 50.
• ∆SLQS is the bias-corrected version of SLQS.w 1 is predicted as hypernym if ∆E(w 1 ) > ∆E(w 2 ), where Evaluation metrics.The classification accuracy of each method was computed separately for the n hyper > n hypo part and for the n hyper < n hypo part of each dataset.Then, we calculated the unweighted average of the accuracy over the four datasets for each part and for both parts.
Results.Table 9 shows the classification accuracy.

G Results on Wikipedia dump
We used the Wikipedia dump (Wikimedia Foundation, 2021)10 with the size of N = 24.0 × 10 8 tokens and |V | = 645 × 10 4 vocabulary words, which was preprocessed by Wikiextractor (Attardi, 2015).The training of the SGNS model and the computation of KL divergence were performed as in Appendix A using the same setting11 .For plotting the results, we used 50,000 words randomly sampled from the 1,114,207 vocabulary words with n w ≥ 10.For fitting the regression line, we used 2,662 words with n w > 10 3 .Fig. 10 shows the word embeddings of the Wikipedia dump computed with the same setting as that of the text8 corpus.The left panel of Fig. 10 is very similar to Fig. 1, confirming that the result for the text8 corpus is reproduced for the Wikipedia dump.The right panel of Fig. 10 corresponds to Fig. 8 with the axes exchanged and the log 10 n w axis rescaled.Again, the two plots are very similar.
However, the result changes when the epoch of training is reduced, thus the optimization is insufficient.Fig. 11 shows the word embeddings of the Wikipedia dump, but the epoch was reduced to 10.In the left panel, the linear relationship was not reproduced.Looking at the right panel, the norm of embedding reduces for low-frequency words with n w < 100; plots of the same shape are also found in the literature (Schakel and Wilson, 2015;Arefyev et al., 2018;Pagliardini et al., 2018;Khodak et al., 2018).This is considered a consequence of insufficient optimization epochs; the norm of parameters tends to be smaller due to the implicit regularization (Arora et al., 2019), thus the trained parameters do not satisfy the ideal SGNS model (4) very well, particularly for low-frequency words.

H Results on pre-trained word embeddings
In this section, we show that the linear relationship between the KL divergence and the squared norm of word embedding holds also for pre-trained word embeddings.

H.1 Pre-trained fastText embeddings
We used Wiki word vectors provided by Bojanowski et al. (2017).These 300-dimensional embeddings are trained for 5 epochs on Wikipedia with the fastText model.We used the same KL divergence as in Appendix G, which was calculated on the Wikipedia dump corpus.Results are shown in the left panel of Figure 12, where we randomly selected 10,000 words that appeared not less than 10 4 times in the Wikipedia dump.

H.2 Pre-trained SGNS embeddings
We used pre-trained SGNS vectors provided by Li et al. (2017).These 500-dimensional embeddings are trained for 2 epochs on Wikipedia with the SGNS model.We used the same KL divergence as in Appendix G, which was calculated on the Wikipedia dump corpus.Results are shown in the right panel of Figure 12, where we randomly selected 10,000 words that appeared not less than 10 4 times in the Wikipedia dump.

I Results on contextualized embeddings
Settings.For the experiment of contextualized word embeddings, we used embeddings obtained from the final layer of BERT, RoBERTa, GPT-2, and Llama 2. We obtained 2000 sentences from One Billion Word Benchmark (Chelba et al., 2014) and input them into each language model to get contextualized embeddings of all tokens.Special tokens at the beginning and end of tokenized inputs, if any, were excluded.
Results.Looking at the scatterplots in Fig. 13, approximate linear relationships can be observed in BERT, RoBERTa, and Llama 2, but in GPT-2, the linear relationship is somewhat weaker.According to the values in Table 10, whitening improves the linear relationship for GPT-2 and Llama 2, but it worsens for BERT and RoBERTa, and the effect of whitening is not clear-cut.While there is still room for discussion, overall, an approximate linear relationship between KL divergence and the squared norm of contextual embeddings appears to hold.

J Basic properties of the exponential family of distributions
The expectation and covariance matrix.The first and second derivatives of ψ(u) are computed as showing ( 6) and ( 7), respectively.
KL divergence.For computing the KL divergence, first note that from (4).Thus, the KL divergence is showing (8).
Approximation of KL divergence.Next, we consider the Taylor expansion of ψ(u) at u = u 1 .By ignoring higher order terms of O(∥u − u 1 ∥ 3 ), we have Using ( 6) and ( 7), we can rewrite this expression for u = u 2 as and substituting it into (18), we obtain showing ( 9) for i = 1.Considering the Taylor expansion of G(u) Thus we can rewrite the right hand side of (20) as ).Therefore, we have shown that (9) holds for both i = 1 and i = 2.
K High-dimensional random vectors Random vector setting.In this section, we adopt a probabilistic viewpoint and treat the elements of vectors u and v as random variables denoted by u i and v i for i = 1, . . ., d to estimate the orders of magnitude of various quantities, such as vector norms.Although the embedding vectors {u w } w∈V , {v w ′ } w ′ ∈V are not random variables, the random variable setting is justified when we randomly sample words w and w ′ from a large corpus and set u = u w and v = v w ′ .To simplify the analysis, we assume that the vector elements are distributed independently.While we could relax this assumption by imposing the spherical condition (Jung and Marron, 2009;Aoshima et al., 2018), we leave this extension for future work.We aim to discuss the relative magnitudes of vectors, so rescaling the vectors does not affect the argument.Therefore, we assume that each element is proportional to d −1/2 , i.e., u i = O p (d −1/2 ), v i = O p (d −1/2 ).The squared norm of u is ∥u∥ 2 = d i=1 (u i ) 2 = O p (d • (d −1/2 ) 2 ) = O p (1), and the norm itself is also ∥u∥ = (∥u∥ 2 ) 1/2 = O p (1).
Here O p (1) means that the magnitude of the vector remains bounded even if the dimension d increases.The same applies to v, i.e., ∥v∥ = O p (1).The inner product of u and v is also ⟨u, v⟩ = d i=1 u i v i = O p (d • (d −1/2 ) 2 ) = O p (1).Throughout this section, we consider magnitudes up to O(d −1 ) and ignore higher order terms of O(d −3/2 ) for sufficiently large d.
ū approximates u 0 .Regarding v, we used only the property v i = O p (d −1/2 ) when deriving (21).So, the result does not change if we replace v by v − v: ⟨u − ū, v − v⟩ = O p (d −1/2 ).However, the result changes if we further replace u by u 0 : meaning that ū approximates u 0 .To show this, we first prepare another presentation of (5) as follows.
Proof of (12).First note that Using ( 22), the magnitude of the remaining terms is obtained as follows.
norm of embedding for word w, denoted as ∥u w ∥, is closely related to the Kullback-Leibler (KL) divergence of the co-occurrence distribution p(•|w) of a word w for a fixed-width window to the unigram distribution p(•) of the corpus, denoted as KL(w) := KL(p(•|w) ∥ p(•)).

Figure 2 :
Figure 2: Confirmation of (11).The slope coefficient of 0.909, which is close to 1, indicates the validity of the theory.

Figure 4 :
Figure 4: Linear relationship between the KL divergence and the squared norm of contextualized embedding for RoBERTa and Llama 2. The color represents token frequency.

Figure 5 :
Figure 5: KL divergence computed with four different procedures plotted against word frequency n w for the same words in Fig. 1. 'raw', 'shuffle', and 'round' are KL(w), KL(w), and KL 0 (w), respectively.'lower 3 percentile' is the lower 3-percentile point of KL(w) at each word frequency bin.

Figure 6 :
Figure 6: The bias-corrected KL divergence ∆KL(w) and the bias-corrected squared norm of word embedding ∆∥u w ∥ 2 are plotted against word frequency n w .Each dot represents a word; 10561 proper nouns (red dots), 123 function words (blue dots), and 4771 verbs (green dots).The same plot for adjectives, which is omitted in the figure, produced a scatterplot that almost overlapped with the verbs.

Figure 7 :
Figure 7: The Shannon entropy and the squared norm of word embedding.Settings are the same as in Fig. 1.

Figure 8 :
Figure 8: The self-information and the squared norm of word embedding.Settings are the same as in Fig. 1.

Figure 9 :
Figure 9: Histogram of each measure used for binary classification of part-of-speech.Plotted for 10561 proper nouns (red) and 4771 verbs (green) in the text8 corpus.

Figure 12 :
Figure 12: Two pre-trained word embeddings.Each regression line was fitted to all the points in the scatterplot.

Figure 13 :
Figure13: Linear relationship between the KL divergence and the squared norm of contextualized embedding for BERT, RoBERTa, GPT-2, and Llama 2. The color represents token frequency.

Table 1 :
Top 10 words and bottom 10 words sorted by the value of KL(w) in the text8 corpus with word frequency n w ≥ 10.

Table 4 :
Accuracy of hypernym-hyponym classification; the unweighted average over the four datasets.See Table9in Appendix F for the complete result.

Table 6 :
MRR of keyword extraction experiment.

Table 7 :
Typerandom n w n w H(w) χ 2 (w) n w KL(w) P@5 of keyword extraction experiment.

Table 8 :
Randomly sampled proper nouns for each range of informativeness measured by the KL divergence.

Table 9 :
Accuracy of hypernym classification.For each method, ∆Method is the bias-corrected version.We divided each dataset into two parts based on the word frequencies of hypernym (n hyper ) and hyponym (n hypo ).Dataset EVAL denotes EVALution.
Table 4 in Section 7.2.2 is a summary of Table 9.Looking at the overall accuracy, ∆∥u w ∥ 2 and ∆KL(w) were the best and the second best, respectively, for predicting hypernym in hypernymhyponym pairs.