Estimating the Entropy of Linguistic Distributions

Shannon entropy is often a quantity of interest to linguists studying the communicative capacity of human language. However, entropymust typically be estimated from observed data because researchers do not have access to the underlying probability distribution. While entropy estimation is a well-studied problem in other fields, there is not yet a comprehensive exploration of the efficacy of entropy estimators for use with linguistic data. In this work, we fill this void, studying the empirical effectiveness of different entropy estimators for linguistic distributions. In a replication of two recent information-theoretic linguistic studies, we find evidence that the reported effect size is over-estimated due to over-reliance on poor entropy estimators. We end this paper with a concrete recommendation for the entropy estimators that should be used in future linguistic studies.


Introduction
There is a natural connection between information theory, the mathematical study of communication systems, and linguistics, the study of human language-the primary vehicle that humans employ to communicate. Researchers have exploited this connection since information theory's inception (Shannon, 1951;Cherry et al., 1953;Harris, 1991). With the advent of modern computing, the number of information-theoretic linguistic studies has risen, exploring claims about language such as the optimality of the lexicon (Piantadosi et al., 2011;Pimentel et al., 2021), the complexity of morphological systems (Cotterell et al., 2019;Wu et al., 2019;Rathi et al., 2021), and the correlation between surprisal and language processing time (Smith and Levy, 2013;Bentz et al., 2017;Goodkind and Bicknell, 2018;Cotterell et al., 2018;Meister et al., 2021, inter alia). In information-theoretic linguistics, a fundamental quantity of research interest is entropy. Entropy is both useful to linguists in its own right, and is necessary for estimating other useful quantities, e.g., mutual information. However, the estimation of entropy from raw data can be quite challenging (Paninski, 2003;Nowozin, 2015), e.g., in expectation, the plug-in estimator underestimates entropy (Miller, 1955). Linguistic distributions often present additional challenges. For instance, many linguistic distributions, such as the unigram distribution, follow a power law (Zipf, 1935;Mitzenmacher, 2004). 1 Linguistics is not the only field with such nuances, and so a large number of entropy estimators have been proposed in other fields (Chao and Shen, 2003;Archer et al., 2014, inter alia). However, no work to date has attempted a practical comparison of these estimators on natural language data. This work fills this empirical void.
Our paper offers a large empirical comparison of the performance of 6 different entropy estimators on both synthetic and natural language data, an example of which is shown in Figure 1. We find that Chao and Shen's (2003) is the best estimator when very few data are available, but Nemenman et al.'s (2002) is superior as more data become available. Both are significantly better (in terms of meansquared error) than the naïve plug-in estimator. Importantly, we also show that two recent studies (Williams et al., 2021;McCarthy et al., 2020) show smaller effect sizes when a better estimator is employed; however, we are able to reproduce a significant effect in both replications. We recommend that future studies carefully consider their choice of entropy estimators, taking into account data availability and the nature of the underlying distribution. 2

Entropy and Language
Shannon entropy is a quantification of the uncertainty in a random variable. Given a (discrete) random variable X with probability distribution p over K possible outcomes X = {x k } K k=1 , the Shannon entropy of X is defined as p(x k ) log p(x k ) (1) Entropy has many uses throughout science and engineering; for instance, Shannon (1948) originally proposed entropy as a lower bound on the compressibility of a stochastic source.
Yet the application of information-theoretic techniques to linguistics is not so straightforward: Information-theoretic measures are defined over probability distributions and, in the study of natural language, we typically only have access to samples from the distribution of interest, e.g., the phonotactic distribution in English, which permits word we cannot find in a corpus, like blick, rather than the true probabilities required in the computation of Eq. (1). Indeed, it is often the case that not all elements of X are even observed in available data-such as words that were coined after the a corpus was collected.
Rather, p must be approximated in order to estimate H(p). One solution is plug-in estimation: Given samples from p, the maximum-likelihood estimate for p is "plugged" into Eq. (1). However, as originally noted by Miller (1955), this strategy generally yields poor estimates. 3 It is thus necessary to derive more nuanced estimators.

Statistical Estimation Theory
Statistical estimation theory provides us with the tools for estimating various quantities of interest based on samples from a distribution.
Central to this theory is the estimator: A statistic that approximates a property of the distribution our data is drawn from. More formally, let D = { x (n) } N n=1 be samples from an unknown distribution p. Suppose we are interested in a quantity θ that can be computed as a function of the distribution p. An estimator θ(D) for θ is then a function of the data D that provides an approximation of θ.
Two properties of an estimator are often of interest: bias-the difference between the true value of θ and the expected value of our estimator θ(D) under p-and variance-how much θ(D) fluctuates from sample set to sample set: It is desirable to construct an estimator that has both low bias and low variance. However, the bias-variance trade-off tells us that we often have to pick one, and we should focus on a balance between the two. This trade-off is evinced through mean-squared error (MSE), a metric oft-employed for assessing estimator quality: MSE( θ(D)) = bias( θ(D)) 2 + var( θ(D)) (4) To recognize the trade-oft note that, for any fixed MSE, a decrease in bias must be compensated with an increase in variance and vice versa. Indeed, it is important to recognize that there is typically no single estimator that is seen as "best." Different estimators balance the bias-variance trade-off differently, making their perceived quality specific to one's use-case. Importantly, the effectiveness of an estimator also depends on the domain of interest. Consequently, an empirical study of various entropy estimators, which this paper provides, is necessary in order to determine which entropy estimators are best suited for linguistic distributions.

Plug-in Estimation of Entropy
A simple, two-step approach for estimating entropy is plug-in estimation. In the first step, we compute the maximum-likelihood estimate for p from our dataset D as follows In the second step, we plug Eq. (5) into Eq. (1) directly, which results in the estimator H MLE (D). So why is this a bad idea? While our probability estimates themselves are unbiased, entropy is a concave function. Consequently, by Jensen's inequality, this estimator is, in expectation, a lower bound on the true entropy (see App. E.1 for proof). Moreover, when N K, which is often the case in power-law distributed data, the estimate becomes quite unreliable (Nemenman et al., 2002). -Miller (1955) and Madow (1948). The first innovation in entropy estimation known to the authors is a simple fix derived from a first-order Taylor expansion of MLE (described above). The Miller-Madow estimator only involves a simple additive correction, which is shown below:

MM
where K is size of the support of X . The Miller-Madow correction should seem intuitive in that we add K−1 2N ≥ 0 to compensate for the negative bias of the estimator. A full derivation of the Miller-Madow estimator is given in Proposition 2. -Zahl (1977). Next we consider the jackknife, which is a common strategy used to correct for the bias of statistical estimators. In the case of entropy estimation, we can apply the jackknife out of the box to correct the bias inherent in the MLE estimator. Explicitly, this is done by averaging plug-in entropy estimates H MLE (D) albeit with the n th sample from the data removed; we denote this held-out plug-in estimator as H \n MLE (D). Averaging these "held-out" plug-in estimators results in the following simple entropy estimator

JACK
Note that the jackknife is applicable to any estimator, not just H MLE (D), and, thus, can be combined with any of the other approaches mentioned.
HT-Horvitz and Thompson (1952). Horvitz-Thompson is a general scheme for building estimators that employs importance weighting in order to more efficiently estimate a function of a random variable. Importantly, this estimator gives us the ability to compensate for situations where the probability of an outcome is so low that it is often not observed in a sample, which is often the case for e.g., power-law distributions.
While a full exposition of HT estimators is outside of the scope of this work, in essence, we can divide the expected probability of a class by each class's estimated inclusion probability to compensate for such situations. Given the true probability of an outcome p(x k ), the probability that it occurs at least once in a sample of size N is 1 − (1 − p(x k )) N . The HT estimator for entropy is then defined as using our MLE probability estimates p MLE (x k ). (2003). Chao-Shen modifies HT by multiplying the MLE probability estimates by an estimate of sample coverage. Formally, let f 1 be the number of observed singletons 4 in sample; our sample coverage can be estimated as C = 1 − f 1 N . The CS estimator is then computed as:

CS-Chao and Shen
In the case that f 1 = N , we set f 1 = N − 1 to ensure the estimated entropy is not 0. -Wolpert and Wolf (1995). One family of entropy estimators in information theory is based on Bayesian principles. The first of these was the Wolpert-Wolf estimator, which uses a Dirichlet prior (with concentration parameter α and a uniform base distribution). This Bayesian estimator has a clean, closed form:

WW
where α k = c(x k ) + α k (for the histogram count c(x k ) of class k in the sample; this is analogous to MAB MSE 10 2 10 3 10 4 10 5 10 2 10 3 10 4 10 5  Table 1: The best unigram entropy estimators on the corpora studied, tested on various N averaged over 100 samples. All differences are statistically significant on the permutation test; lighter color indicates fewer statistically significant comparisons on the Tukey test. Scale: significantly better than 6 5 4 3 2 1 0 other estimators.
Laplace smoothing), A = K k=1 α k , and ψ is the digamma function. A full derivation of Eq. (10) is given in Proposition 3. Unfortunately, Eq. (10) is very dependent on the choice of α: For large K, α almost completely determines the final entropy estimate, an observation first made by Nemenman et al. (2002) which motivated their improved estimator described below. -Nemenman et al. (2002). Nemenman et al. (NSB) attempt to alleviate the Wolpert-Wolf estimator's dependence on α. They take α = α · 1, enforcing that the Dirichlet prior is symmetric, and develop a hyperprior over α that results in a nearuniform distribution over entropy. The hyperprior is given by

NSB
where ψ 1 is the trigamma function. A full derivation of Eq. (11) is given in Proposition 4. This choice of hyperprior mitigates the effect that the chosen α has on the entropy estimate. Nemenman et al. 's (2002) entropy estimator is then the posterior mean of the Wolpert-Wolft estimator taken under p NSB : (12) Typically, numerical integration is used to quickly compute the unidimensional integral.

Experiments
Here we provide an evaluation of the entropy estimators presented in §3.2 on linguistic data.

Entropy of the Unigram Distribution
We start our study with a controlled experiment where we estimate the entropy of the truncated unigram distribution, the (finite) distribution over the frequent word tokens in a language without regard to context (Baayen et al., 2016;Diessel, 2017;Divjak, 2019;Nikkarinen et al., 2021). We renormalize the frequency counts of corpora in English, German, and Dutch (taken from CELEX; Baayen et al., 1995), as well as Mongolian and Tagalog (from Wikipedia 5 ). We take this renormalization as a gold standard distribution, since we cannot access the underlying unigram distribution. We then draw samples of varying sizes (N ∈ {10 2 , 10 3 , 10 4 , 10 5 }) from the distribution of renormalized frequency counts to test the estimators' ability to recover the underlying distributions' entropy. While the renormalized frequency counts are not necessarily representative of the true unigram distribution, they nevertheless provide us with a controlled setting to benchmark various entropy estimators.
We evaluate the estimators on both bias and MSE, as defined in (2) and (4), as well as mean absolute bias (MAB). To test the statistical significance of differences in metrics between entropy estimators, we use paired permutation tests (Good, 2000) (sampling 1, 000 permutations) between pairs of estimators, checking MAB and MSE. We run Tukey's test (1949) to judge the statistical significance of differences in MAB and MSE between all pairs of estimators, which found only a few insignificant comparisons when N was large.
Results are shown in Table 1 and Figure 1. We find that NSB (followed closely by CS) converges almost to the true entropy from below using with only a few samples. HT is the best estimator for N < 2, 000, but as N increases it tends to overestimate entropy to the point where its bias is greater than that of MLE. Besides HT, all estimators at all tested sample sizes N have lower MAB and MSE than MLE.  Table 2: Normalized mutual information, calculated with several estimators, between adjectives and the inanimate nouns they modify based on UD corpora. Colored-in cell means statistically significant NMI value.

Replication of Williams et al. (2021)
Next, we turn to a replication of Williams et al.'s (2021) information-theoretic study on the association between gendered inanimate nouns and their modifying adjectives. They estimate mutual information by using its familiar decomposition as the difference of two entropies: The entropies H(X) and H(X | Y ) are estimated independently and then their difference is computed. We replicate Williams et al.'s (2021) experiments using gold-parsed Universal Dependencies corpora, filtering out animate nouns with Multilingual WordNet (Bond and Foster, 2013). We rerun their experimental set-up using our full suite of entropy estimators to determine whether the relationship they posit remains significant, checking 3 more languages not in the original study. We report results for normalized mutual information (dividing MI by maximum possible MI) in Table 2. We find that using NSB (the estimator we found most effective in §4.1) instead of MLE, nearly halves the measured effect in all languages. However, the effect remains statistically significant in 5 of 7 languages tested, including the 4 that were also in the original study.

Replication of McCarthy et al. (2020)
Finally, we turn our attention to McCarthy et al.'s (2020) study on the similarity between grammatical gender partitions between languages. Using information-theoretic measures, they found that closely related languages have more similar gender groupings of core lexical items. We replicate their experiment on Swadesh lists (Swadesh, 1955) for 10 European languages with different estimators, and find that hierarchical clustering over both mutual (MI) and variational information (VI) produces the same trees as the original study. In this case, using NSB, our recommended estimator, results in a reduced estimate of MI (e.g. Croatian-Slovak: 0.54 with MLE → 0.46 with NSB), but significance test-ing with 1,000 permutations finds the same pairs were statistically significant for both MI and VI regardless of estimator: all pairs of Slavic languages and Romance languages, and Bulgarian-Spanish (see Figure 2). Thus, we see a similar result here as in the previous replication.

Conclusion
This work presents the first empirical study comparing the performance of various entropy estimators for use with natural language distributions. From experiments on synthetic data (appendix) and natural data (CELEX), and two replication studies of recent papers in information-theoretic linguistics, we find that the oft-employed plug-in estimator of entropy can cause misleading results, e.g., the overestimates of effect sizes seen in both replication studies. The recommendation of our paper is that researchers should carefully consider their choice of entropy estimator based on data availability and the nature of the underlying distribution.

Ethics Statement
The authors foresee no ethical concerns with the research presented in this paper.
R. Harald Baayen, Petar Milin, and Michael Ramscar. 2016. Frequency in lexical processing. Aphasiology, 30 (11) Table 3: Estimators with least MAB (mean absolute bias) and MSE (mean squared error) for various combinations of N and K sampling from symmetric Dirichlet. The lighter the color the fewer estimators the best estimator was found to be statistically significantly better than.

A Implementation
The code for each of the entropy estimators is implemented in Python using numpy (Harris et al., 2020), except for NSB which was taken from an existing efficient implementation in the ndd module (Marsili, 2016). We calculated entropies with base e (in nats).

B Experiments with simulated data
In our experiments with simulated data, we explore distributions sampled from a symmetric Dirichlet prior with varying number of classes K and known distributions of Zipfian form with various parameters. Words in natural languages have a roughly Zipfian distribution, with probability inversely proportional to rank (Zipf, 1935), and a symmetric Dirichlet distribution is analogous to e.g. POS tag label distributions in natural language. Thus, studying synthetic data from such distributions as a start is useful.

B.1 Experiment 1: Symmetric Dirichlet distributions
We sample 1, 000 distributions from a symmetric Dirichlet distribution with variable number of classes K, i.e. with paramater α = [α 1 , . . . , α K ] = [1, . . . , 1]. We calculate entropy estimates on different sample sizes N . Since we know the parameters of the true distribution, we can compare estimates with the true entropy. We do pairwise comparisons of the MAB and MSE of estimators, using paired permutation tests to establish significance. Table 3 shows our results, including significance tests. It is clear that when N K, all of the estimators have nearly converged to the true value and estimator choice does not matter. However, in the low-sample regime some estimators are indeed significantly better at approximating the true entropy. Our results are mixed as to which estimator is best in what context; the one found to be most frequently significantly better than other estimators was Chao-Shen. What is clear is that MLE is never the best choice.

B.2 Experiment 2: Zipfian distributions
We sample 1, 000 finite Zipfian distributions with K classes which obey Zipf's law, that the probability of an outcome is inverse proportional to its rank. The experimental setup is the same as in Experiment 1. A Zipfian distribution approximates (but is not a perfect model of) the distribution of tokens in natural language text in some languages, including English, which was the basis for the law being proposed. Compare similar experiments on infinite Zipf distributions by Zhang (2012). Results are in Table 4.

C Replication of Williams et al. (2021)
We used the following UD treebanks:    Pairwise MSE p-values (Dirichlet) Figure 4: The heatmaps display the p-values calculated between pairs of estimators for mean absolute bias (MAB) and mean squared error (MSE) for Experiment 1. More purple values mean the estimator on the y-axis (Estimator 2) is better than the estimator on the x-axis (Estimator 1). Comparisons tend to become non-significant as N increases, since all the estimators gradually converge to the true entropy.

E Derivation of the Entropy Estimators
Let X = {x k } K k=1 be a finite set. Let p be a distribution over X . The entropy of p is defined as Given a dataset of N samples D sampled i.i.d. from p, our goal is to estimate the entropy H(p) from samples D from the true distribution p. We will denote the count of an item x k as The plug-in estimate of H(p) is defined to be the estimate of H(p) obtained by plugging the MLE estimate p MLE directly into the definition of entropy, i.e., This section discusses the problems with Eq. (14) as an estimator and provides detailed derivations of improved estimators found in the literature.

E.1 The Plug-in Estimator is Negatively Biased
Proposition 1. The MLE entropy estimator in expectation underestimates true entropy, i.e., Proof. The result is a simple consequence of Jensen's inequality and some basic manipulations: This completes the result.

E.2 Miller-Madow
Proposition 2. Let p be a categorical distribution over X = {x 1 , . . . , x K }, i.e., a categorical distribution with support K. Let D be our dataset of size N sampled from p. Finally, let p MLE be the maximumlikelihood estimate computed on D. Then, we have Proof. We start by taking a first-order Taylor expansion and take an expectation of both sides.
This gives us: Thus, we may compactly write the bias as: Now, we find a simpler expression for the remainder E p [KL( p MLE || p)]. Again, we start with a secondorder Taylor expansion is the count of x k in the training set. We now simplify the first term: Next, we simplify the second term, o ∆(x) 2 , in the MLE case: Putting it all together, we get that bias (H( p MLE )) = − K−1 2N + o N −1 which is the desired result.
Interestingly, it can be seen that the negative bias of the MLE gets worse as the number of classes K grows. Distributions with large K pop up frequently when dealing with natural language.
Corollary 1. The plug-in estimator of entropy is consistent.
Proof. From Proposition 2, we have bias (H( p MLE )) = − K−1 2N + o N −1 . Clearly, as N → 0, we have bias (H( p MLE )) → 0, so the estimator is consistent. One could also prove consistency through a simple application of the continuous mapping theorem.
Estimator 1 (Miller-Madow). Let p be a categorical over K categories. We seek to estimate the entropy H(p). Let D be our dataset of size N sampled from p. Then, the Miller-Madow estimator of H(p) is given by The Miller-Madow estimator is biased, however it is consistent.
Lemma 1. The the first-order Taylor approximation of H MLE (D) around the distribution p is given by where the remainder R is given by Proof. The result follows from direct computation. We start by taking the Taylor expansion of H( p MLE ) around H(p): Our first order term can then be rewritten as follows: Plugging this back into our Taylor expansion, we get the following: Now, we see that this implies which is the desired result.
Lemma 2. Define ∆(x) = p(x) − q(x). The second-order Taylor expansion of KL(p || q) around ∆(x) is given by Proof. Now we compute the series expansion of the KL-divergence. We first make a tricky substitution: Now, we proceed with the derivation: which is the desired result.

E.3 Jackknife
The jackknife resampling method is used to estimate the bias of an estimator and correct for it, by sampling all subsamples of size N − 1 from the available sample of size N , computing their average for the statistic being estimated. Generally, this reduces the order of the bias of an estimator from O(N −1 ) to at most O(N −2 ) (Friedl and Stampfer, 2002).
Estimator 2 (Jackknife). Let p be a categorical over K categories. We seek to estimate the entropy H(p). Let D be our dataset of size N sampled from p. Let H \n (D) be an estimate of the entropy from a sample with the n th observation held out. Then, the Jackknife estimator is given by This estimator is derived from the jackknife-resampled estimate of the bias of the MLE estimator, multiplied by N − 1.

E.4 Horvitz-Thompson
Horvitz and Thompson (HT; 1952) is a common estimator given a finite universe, which is our case as K is finite. We omit a derivation a full here as it is well documented in other places (Vieira, 2017). However, we note that, in contrast to many applications of HT, the application of HT to entropy estimation results in a biased estimator as the function whose mean we seek to estimate is log p(x k ), which is dependent on the unknown distribution p.
Estimator 3 (Horvitz-Thompson). Let p be a categorical over K categories. We seek to estimate the entropy H(p). Let D be our dataset of size N sampled from p. Then the Horvitz-Thompson estimator is defined as where 1 − (1 − p MLE (x k )) N is an estimate of the inclusion probability, i.e., the probability that x k appears in a random sample D of size N .
We do not know of a simple expression for the bias of the Horvitz-Thompson entropy estimator, but one observation is that E p (1 − p MLE (x k )) N > E p (1 − p(x k )) N when N > 1 (justified by Jensen's inequality, since x N , N > 1 is convex over [0, 1]); this is an overestimate of the true inclusion probability.

E.5 Chao-Shen
The Chao-Shen estimator builds upon Horvitz-Thompson by noting that that estimator does not correct for underestimation of number of classes K and resulting effect on estimates of p(x k ); i.e. 1−(1− p MLE (x k )) N is always 0 for a class not included in the sample even if the class is present in the true distribution. We can reweight the sample probabilities to compensate for missing classes using the notion of sample coverage.
Definition 1 (Sample coverage). We define the sample coverage as Definitionally, (1 − C) is then the probability of sampling an x k not observed in the sample X .
However, exact computation of Eq. (88) is impossible as we do not know the true distribution p. Thus, Chao and Shen (2003) fall back on a well-known estimator of C that uses a technique from Good-Turing (1953) smoothing. Let f 1 be the number of classes with only one observation in the current sample, i.e, the number of singletons, then we can estimate the sample coverage as The Chao-Shen estimator, described below, simply re-scales the MLE estimate of probability p MLE (x k ) in the HT estimator by C. This corrects for the observed underestimation of p's entropy by HT.
Estimator 4 (Chao-Shen). Let p be a categorical over K categories. We seek to estimate the entropy H(p). Let D be our dataset of size N sampled from p. Let C, an estimate of sample coverage, be defined as in Eq. (88). The Chao-Shen estimator is then defined as Fact 2 (Normalizer of a Dirichlet). The normalizer of a Dirichlet distribution is A relatively easy proof of this fact makes use of a Laplace transform.
Estimator 5 (Wolpert-Wolf). Let p be a categorical over K categories. We seek to estimate the entropy H(p). Let D be our dataset of size N sampled from p. Then, the Wolpert-Wolf estimator is given by The expectation of entropy under a Dirichlet posterior Dirichlet(α) where parameter α is given by where Proof. Let Dirichlet(α 1 , . . . , α K ) be a Dirichlet posterior. The result follows by a series of manipulations: (113) which proves the result.

E.7 Nemenman-Shafee-Bialek
Estimator 6 (Nemenman-Shafee-Bialek). Let p be a categorical over K categories. We seek to estimate the entropy H(p). Let D be our dataset of size N sampled from p. Define the NSB density as where ψ 1 is the trigramma function. Then, the NSB estimator is given by The integral in Eq. (115) is typically computed by numerical integration.
To derive the Nemenman-Shafee-Bialek (NSB) estimator, we start with the idea that we would like a prior over distributions such that the distribution over expected entropy is uniform. In other words, we are looking for a p NSB such that for α ∼ p NSB , the values of E p [H(p) | α] are uniformly distributed over [0, log K]. This is a good idea since, a-priori, we do not know entropy of p and, in the absence of any insight, we should assume the entropy could be anywhere in the range [0, log K]. We make the above intuition formal with the following proposition.
Proof. First, we note that E p [H(p) | α] is a continuous, increasing function in α. We will not prove this formally, but it should make intuitive sense: α is a smoothing parameter and the more the distribution is smoothed, the more entropic it should be. From basic analysis, we know that a strictly continuous, increasing function has an inverse.
Note H is a random variable and unrelated to the functional H(·); the choice of letter intentionally reminds one that the variable represents the expected entropy of under a random distribution. Now we apply the change-of-variables formula at H = g −1 (α) and manipulate: = Kψ 1 (Kα + 1) − ψ 1 (α + 1) log K (Lemma 3) By construction, the prior p NSB (α) has the property that the expected entropy E p [H(p) | α] where α ∼ p NSB (·) is uniformly distributed over [0, log K], which we can see by reversing the above derivation. This proves the result.
Nemenman et al. (2002) interpreted Proposition 4 in the following manner: As the variance of E p [H(p) | α], which is treated as a random variable since α is random, approaches 0, then the the NSB estimator implies a uniform prior over the entropy.

Jackknife
The jackknife resampling method is used to estimate the bias of an estimator and correct for it, by sampling all subsamples of size N − 1 from the available sample of size N . Generally, this reduces the order of the bias of an estimator from O(N −1 ) to at most O(N −2 ) (?).
ryan Is this a good citation? I can't find it. Can you add the link and properly capitalize it.
Estimator 1 (Jackknife). Let p be a categorical over K categories. We seek to estimate the entropy H(p). Let X = { x n } N n=1 be sampled x n ∼ p. Let H \n (X) be an estimate of the entropy from a sample with the n th observation held out. Then, the Jackknife estimator is given by The estimator can be simplified to be a summation over classes, which will make further analyses tractable. The tricky part to simplify here is N n=1 H \n MLE (X). This is the sum of all estimated entropies over samples such that one of the observations is removed.
This means that for a class k with c k observations, in this sum there will be c k times that one of its class members is removed. In that case, its contribution to the whole sum will be −c k c k −1 N −1 log c k −1 N −1 . The remaining n − c k instances will not have a member of class k removed. Their total contribution to this sum will then be −(n − c k ) c k N −1 log c k N −1 .
Proposition 1. The Jackknife estimator is consistent.
Proof. We take the limit as N → ∞.
We will now use a Laurent series approximation for log a a−1 = ∞ x=1 1 xa x , which will make it possible to simplify this express asymptotically. We have to be careful here however, since the series only converges to the value we are approximating when a > 1; i.e. a = 1 is a special case.