On Homophony and Rényi Entropy

Homophony’s widespread presence in natural languages is a controversial topic. Recent theories of language optimality have tried to justify its prevalence, despite its negative effects on cognitive processing time, e.g., Piantadosi et al. (2012) argued homophony enables the reuse of efficient wordforms and is thus beneficial for languages. This hypothesis has recently been challenged by Trott and Bergen (2020), who posit that good wordforms are more often homophonous simply because they are more phonotactically probable. In this paper, we join in on the debate. We first propose a new information-theoretic quantification of a language’s homophony: the sample Rényi entropy. Then, we use this quantification to revisit Trott and Bergen’s claims. While their point is theoretically sound, a specific methodological issue in their experiments raises doubts about their results. After addressing this issue, we find no clear pressure either towards or against homophony—a much more nuanced result than either Piantadosi et al.’s or Trott and Bergen’s findings.


Introduction
Ambiguity is a hallmark of human language, and is present at all levels of linguistic structure. Both the causes and resulting effects of ambiguity are topics which have sparked much debate. For example, while some claim it has a beneficial impact on communication, others have taken it as a sign of inefficiency. In this work, we contribute to the debate surrounding a specific form of lexical ambiguity in which a wordform shares multiple unrelated meanings: homophony. 1 While the quantitative study of homophony dates back to (at least) Zipf (1949), recently, Trott and Bergen (2020) proposed a new explanation for the higher rate of homophony amongst good, i.e. short or phonotactically well-formed, wordforms: if words were sampled i.i.d. from a phonotactic distribution, then good wordforms would simply be sampled more often. This implies there is no pressure favouring homophony in these words for the sake of efficiency, which directly opposes Piantadosi et al.'s (2012) hypothesis. In fact, Trott and Bergen go further, relying on their experimental results to argue that "homophony may even be selected against in real languages." 2 In this work, we join this debate by proposing a novel quantification of a language's homophony: the sample Rényi entropy 3 -defined as the negative log likelihood (or surprisal) that two instances in an M -sized sample take on the same value. When measured on an observed lexicon, this measure holistically captures the chance that two wordforms coincide, i.e., that they are homophones, providing a new means to test whether lexicons have a pressure in favour or against homophony.
Further, we revisit Trott and Bergen's claims. Whilst their theoretical arguments are sound, we believe their experimental design could not have provided concrete evidence for or against their hypothesis. Specifically, their inadequate modelling of the phonotactic distribution, through the use of weakly regularised n-grams, cause us to question conclusions drawn from their experiments. We take measures to address this issue-relying on more expressive LSTM language models-and provide our own analysis of homophony in natural language.
Experimentally, we arrive at more nuanced results than prior work, finding no pressure either towards or against homophony. We conclude with the warning that biases in our models-and those of other works-need to be deeply considered when relying on them to answer linguistics questions.
2 Their argument is actually more subtle than this. They posit, for instance, that this pressure may be indirect or that there might not actually be a pressure against the existence of homophony per se, but rather that their results could reflect a constraint on the extent to which any given wordform can be saturated with distinct, unrelated meanings. 3 The Rényi entropy (Rényi, 1961) is a generalisation of Shannon's entropy (Shannon, 1948).

Homophony
Homophony is a widespread phenomenon which has long puzzled linguists. On average, roughly 4% of the words in a language are estimated to be homophones (Dautriche, 2015). This rate, however, has a large variation across languages-in English, for instance, Rodd et al. (2002) estimates it to be 7.4%. A number of works hint at the inefficiencies homophony leads to: Rodd et al. (2002) find that homophonous words are recognised more slowly; Mazzocco (1997) shows homophones are harder for children to learn.
Yet a large body of work has argued for the efficiency of homophony. Piantadosi et al. (2012) suggest ambiguity is a desirable property in that it increases a language's communicative efficiency. Ambiguity would allow a language to reuse good wordforms, which falls in line with Zipf's principle of least effort. With this in mind, Piantadosi et al. showed that short, frequent and phonotactically probable wordforms have more homophones than their counterparts. In further support of this hypothesis,  found children easily learn to differentiate homophones when pairs have distinct syntactic categories (and that homophony is more likely in these cases); Pimentel et al. (2020a) showed speakers make contexts more informative in the presence of lexical ambiguity. These results suggest people naturally navigate ambiguity.
However, Trott and Bergen recently proposed a new explanation of Piantadosi et al.'s findings: They attribute homophony to chance. Specifically, they claim that if we model the phonotactic distribution probabilistically, we can see that more homophones amongst good wordforms should be expected-they are simply more probable. Yet their methodology to support this claimcomparing natural lexicons with artificial ones sampled from n-gram language models-has an important setback: their use of 5-gram models with only weak Laplace smoothing. These models are prone to overfitting. As such, it is not surprising that an artificially generated lexicon would contain more homophony than natural ones; as we will show, overfit distributions will likely produce more collisions. Thus, it is not entirely clear what we can conclude from their experiments.

Quantifying Homophony
As per its definition, homophony should be tightly related to a language's phonotactics-its distribu-tion over wordforms. In this section, we first provide a definition of a language's phonotactic distribution. We then present both the Rényi collision entropy and the sample Rényi entropy as new measures of homophony.

Phonotactics and Wordforms
Formally, phonotactics defines a language's set of plausible wordforms. Its classic exemplification, provided by Chomsky and Halle (1965), is that while the unattested wordform blick would be plausible in English, *bnick would not. Under a probabilistic interpretation (Hayes and Wilson, 2008;Gorman, 2013), this can be re-stated as blick having high phonotactic probability, while *bnick has low phonotactic probability. Notedly, a language's phonotactics highly constrains its set of possible wordforms (Dautriche et al., 2017a) and, cross-linguistically, the size of these sets seems to be roughly constant (Pimentel et al., 2020b). Further, phonotactics has a tight relationship with word frequency; more phonotactically likely words are more frequent (Mahowald et al., 2018).
We model the phonotactic distribution over possible wordforms as a language model: (1) whose support is the infinite set W-defined here as the Kleene closure of a phonetic alphabet Σ * , albeit where all w ∈ W are padded with beginningof-and end-of-word symbols. Under this definition, highly plausible wordforms would be assigned high probability, and vice-versa.

Entropy as a Measure of Homophony
The Rényi entropy is a generalisation of the more well-known Shannon entropy. By its informationtheoretic definition, a natural parallel can be drawn between Rényi entropy and homophony. Its general form is defined as for α ≥ 0, = 1. If we take the limit α → 1, we recover the Shannon entropy: which captures the inherent uncertainty in a distribution, i.e. the larger its value, the less predictable the outcome. The case of α = 2 yields the collision entropy (sometimes just termed the Rényi entropy) which, in our setting, is the negative log likelihood that two wordforms sampled i.i.d. from the same distribution are the same. It thus provides a natural quantification of homophony in a language where the words are distributed i.i.d. 4 Although both the collision and Shannon's entropies are measures of uncertainty, they capture distinct properties of the distribution. Shannon's entropy represents the expected surprisal of observing any specific wordform w, while the collision entropy computes the surprisal that a pair of words have identical forms, independent of which form.

Measuring Collisions in a Lexicon
In the previous section, we presented the Rényi collision entropy as measured on a specific phonotactic distribution. The measure in eq. (4) has the unstated assumption that a pair of wordforms would be sampled i.i.d. from this distribution-i.e., the probability of a collision is p(w) 2 , as opposed to p(w (1) )p(w (2) | w (1) ). We do not, however, know if this i.i.d. assumption is valid for naturally occurring lexica. We thus propose a new measure, termed the sample Rényi entropy, which does not inherently encode an i.i.d. assumption. Given an observed lexicon W = { w (m) } M m=1 of size M , we directly measure the surprisal of two randomly selected words being a homophone as In words, the above equation estimates the likelihood of a collision as the number of observed homophones over the number of possible collisions. Notably, if words are sampled i.i.d.-i.e., if there is no pressure in favour or against homophonythe sample Rényi entropy goes to the actual Rényi 4 We note that the Rényi collision entropy measures a specific notion of homophony, one which is closely related to the average number of meanings per wordform. By selecting other values for α in the Rényi entropy, one can capture different properties of the phonotactic distribution. The Rényi min-entropy H∞(p), for instance, is defined by a choice of α = ∞ in eq. (2) and measures the surprisal of the most probable wordform, being instead closely related to the maximum number of meanings per wordform. entropy in eq. (4) as M → ∞. In other words, under the i.i.d. assumption, eq. (5) is a consistent estimator of eq. (4).

A Tractable Estimate of Rényi Entropy
Note that in our setting, it is impossible to exactly calculate the Rényi entropy H 2 (p), given that the support of p (i.e., W) is infinite. In this work, we estimate H 2 (p) over a subset W δ ⊂ W: Fortunately, we can show a tight bound on the approximation when the finite W δ is chosen wisely.
Theorem 3.1. Let W δ be the set of all wordforms with a probability of at least δ, i.e.
We can bound our estimate error as: 5 where we can precisely compute both ξ and η, which are defined as ξ = w∈W δ p(w) and η = w∈W δ p(w) 2 .
Proof. Proof is given in App. D.
This theorem implies H 2 (p) is an upper bound on the true value H 2 (p), which can be made arbitrarily tight for small δ (we choose δ = 10 −8 here).

A Null Hypothesis Test
We now construct a null-hypothesis test to evaluate whether the observed lexicon is shaped by pressures in favour or against homophony. Our null distribution over lexica of size M is defined as where p(w) is a phonotactic distribution. We further define a second distribution over values of the sample Rényi entropy, i.e. p(R(W)), where W is distributed according to p(W). We can now ask whether the Rényi entropy in the observed lexicon is abnormal under the null distribution. This suggests the following null hypothesis test: For a given p(W), we can now test this hypothesis by evaluating the following probabilities: which we can estimate using Monte Carlo sampling.
We reject the null hypothesis if either probability is smaller than 0.005, which yields a confidence value of p < 0.01 under a two-tailed test. Strictly speaking, rejecting T 0 , means that we have rejected that the sample Rényi entropy of the observed lexicon is plausibly consistent with the sample Rényi entropy of a lexicon sampled according to the null distribution p(W).
We now analyse the assumptions we make by using p(W) and discuss what conclusions we may be able to draw despite those assumptions. The two important assumptions are as follows: (i) wordforms are sampled according to p(w); (ii) wordforms are sampled i.i.d.
Therefore, if we believe our phonotactic distribution is correct-i.e., assumption (i) is good-this null hypothesis directly tests whether wordforms are sampled i.i.d. Rejecting it, thus, gives us evidence that homophony is either favoured or hindered in a lexicon. Should we believe assumption (i), we find evidence in support of homophony avoidance if the observed lexicon's sample Rényi entropy is significantly larger than the artificial one's. On the other hand, we find evidence of a pressure in favour of homophony if the observed lexicon's sample Rényi entropy is smaller than its artificial counterpart. Assumption (i) is rather important, however, as we discuss further in §5.

Experimental Methodology
The sample Rényi entropy, presented in eq. (5), can be directly computed on an observed lexicon. On the other hand, both the Rényi entropy (as depicted in §3.2) and our null hypothesis test are computed over a phonotactic distribution, to which we do not have direct access. An important consideration, thus, is how exactly this distribution can be approximated. Recently, Trott and Bergen (2020) relied on weakly regularised n-gram models for their analysis. As can be seen in our earlier results (Pimentel et al., 2020b), neural language models can capture this phonotactic distribution much more faithfully.
In this work, we will compare Trott and Bergen's n-grams with Pimentel et al.'s LSTM models, and show how n-grams may give misleading results.
n-gram. Perhaps the simplest method for estimating distributions of phones in a language is through n-gram, or in this case n-phone, modelling. Specifically, we can estimate the probability of observing some phone w t given the previous n − 1 phones by computing the proportion of times this phone follows those previous n − 1 phones in a corpus. By this definition, sequences not present in the corpus will be assigned 0 probability under the model. This, among other factors, contributes to the often poor generalisation abilities of basic n-gram models. Indeed, there exists an entire literature on smoothing and regularisation techniques for n-gram modelling (Katz, 1987;Ney et al., 1994;Chen and Goodman, 1996). Laplace smoothing is a popular choice, being used in a number of recent works in computational linguistics (e.g. Dautriche et al., 2017a;Trott and Bergen, 2020). However, it is perhaps the simplest of such regularisation techniques, and usually leads to much weaker empirical performances than, e.g., Kneyser-Ney (Ney et al., 1994). It is therefore natural to question whether an n-gram model with simple Laplace smoothing can provide a good representation of the true phonotactic distribution of a language. In our experiments, we follow Trott and Bergen in using a 5-gram model with Laplace smoothing with strength 0.01 as p(w t | w <t ).

LSTM.
In the task of sentence-level language modelling, neural models have surpassed their ngram counterparts with respect to standard evaluation metrics. Neural architectures similarly outperform an n-phone model on the task of representing the phonotactic distribution. We thus make use of a vanilla LSTM character-level language model to estimate this distribution, using a similar architecture to Pimentel et al.'s (2020b). In short, we first retrieve a lookup embedding z t ∈ R e for each phone w t in a wordform. We then feed these into an LSTM (Hochreiter and Schmidhuber, 1997) to get hidden states h t ∈ R d . Finally, these hidden states are linearly transformed and processed by a softmax to arrive at a distribution p(w t | w <t ) over the next token. We train this model by minimising its cross-entropy with the distribution of the observed data. We use an LSTM architecture with 2 layers, an embedding size of 64, a hidden size of 256, and dropout probability of .33. This model is implemented using PyTorch (Paszke et al., 2019) and optimised using Adam (Kingma and Ba, 2015).

Model Selection.
We evaluate the quality of our models by measuring their cross-entropy on heldout data, as is common in language modelling. We report their train and test cross-entropies in Table 1. Note that minimising this cross-entropy is equivalent to minimising the Kullback-Leibler divergence between our estimated model and the actual phonotactic distribution. Thus, this serves as a metric of how well our model fits the data.
Data. We use CELEX (Baayen et al., 1995) as the source of data for our experiments, a dataset which covers three languages (English, German, and Dutch). We restrict our analysis to monomorphemic words, 6 and note that we count words with multiple parts of speech as homophones (as both Piantadosi et al. and Trott and Bergen do). This may inflate the number of homophones in our actual lexicons, thus reducing their surprisal in our analysis. CELEX, however, marks zero derivation forms; we thus do not use these words on our analysis. When computing the plug-in estimate of the lexicon's Rényi entropy (in eq. (5)) we use our entire dataset. We further use these wordforms to train our phonotactic models, splitting them in 80-10-10 train-validation-test sets. The test set is held out and only used for estimating the cross-entropy.

Results, Discussion and Conclusion 7
Table 1 displays our main results: 8 first we note that in terms of cross-entropy, the LSTM models provide better representations of the phonotactic distributions of all three languages. Second, the Shannon entropy of the LSTM is smaller than the entropy of the n-gram. The n-gram, thus, appears to distribute probability mass more uniformly over the set W than the LSTM, while the LSTM is more focused on the set of plausible wordforms.
The Rényi collision entropy results must be more carefully analysed. At first glance, we see that the n-gram model has the smallest Rényi entropy across all languages, having more than 1 bit difference to the lexicon's sample Rényi entropy in both English and German. This may lead one to conclude that homophony is strongly disfavoured in  all these languages. Nonetheless, the LSTM's collision entropy is considerably larger than the n-gram model's, while having both a lower cross-entropy and Shannon entropy. We posit this is due to the n-gram strongly overfitting the training set, giving these instances a higher probability than they are due. These few overfit wordforms drive its Rényi entropy down, while the rest of the probability mass is spread over W and increases the n-gram's Shannon entropy. 9 In other words, n-gram models do not approximate p(w) well, and the assumption (i) of our hypothesis test does not hold.
When we compare the Rényi entropy of the LSTM to the lexicon's, we get much more nuanced results. While the English lexicon seems to hinder homophony-homophony is more surprising in real lexicons than expected from their phonotactics-the opposite is true for Dutch. Meanwhile, German presents no clear trends. We should, however, refrain from making strong claims about these results. While the difference between the LSTM's train and test cross-entropy is small, implying that it overfits only to a small degree, its precise quantitative impact on the Rényi entropy is hard to quantify. Furthermore, expanding our analysis to CELEX's multi-morphemic words leads to somewhat different results (see App. B). Hence, we see no clear pattern across these languages, and find, thus, no pressure either in favour or against homophony.
We conclude this section with a warning. When exploring linguistics using language models, one should carefully consider these model's inherent inductive biases and their potential effects on results. While overfit n-grams provide strong evidence towards homophony avoidance in natural lexicons, we arrive at different results using better models.

C Shannon vs. Collision Entropy
In this section, we exemplify the difference between the Rényi and Shannon entropies. With that in mind, we define a distribution over n + 1 instances, where probability mass is distributed such that: This distribution, thus, puts k probability mass on x 0 , and uniformly distributes the rest among the n other instances. Figure 2 shows the behaviour of both entropies with n fixed at 99 and while we vary the mass in p(x 0 ). In it, we see how the Rényi entropy is always smaller or equal to the Shannon entropy-an important property of the Rényi collision entropy. Figure 1 presents these entropies while changing both n and k, i.e. p(x 0 ). In this figure, we see that when some large probability mass is already allocated to a single instance (or a few), the Rényi entropy becomes relatively constant with relation to the distribution among the other instances. The Shannon entropy, on the other hand, is still susceptible to these other instances distribution, and goes to infinity as n → ∞.
Relating this to our analysed n-gram models, we see that, by allocating a large probability mass to the training set, they can obtain a small Rényi entropy. However, since they smoothly distribute the rest of their probability mass throughout W they achieve a high Shannon entropy. D Proof of theorem 3.1 Theorem 3.1. Let W δ be the set of all wordforms with a probability at least δ, i.e. W δ = {w | w ∈ W, p(w) ≥ δ} We can bound our estimate error as: H 2 (p) ≤ H 2 (p) ≤ H 2 (p) + log 1 + (1 − ξ) δ η where we can precisely compute both ξ and η, which are defined as Proof. We first decompose the error in our estimate as where η = w∈W δ p(w) 2 and equality (1) follows from the definition of H 2 and the separation of the sum into two parts. We define ξ = w∈W δ p(w) and, thus, 1 − ξ = w∈W\W δ p(w). Now, by invoking Lemma D.1, we have the following inequality which proves the theorem. x 2 n ≤ β · δ Proof. We claim the maximal solution-i.e., the set {x n } N n=1 which maximises eq. (12)-is x k = δ for k ∈ 1, . . . , K and x K+1 = β − Kδ for some K < N . We prove its maximality by contradiction. Suppose there exists another maximal solution. Further, suppose that the values for this solution are sorted, such that x i ≥ x j for any i < j. Then, there must exist two indices i and j such that δ > x i ≥ x j and i < j. Now, let = δ − x i > 0. We can prove that: