Modeling the Unigram Distribution

The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. While of central importance to the study of language, it is commonly approximated by each word's sample frequency in the corpus. This approach, being highly dependent on sample size, assigns zero probability to any out-of-vocabulary (oov) word form. As a result, it produces negatively biased probabilities for any oov word form, while positively biased probabilities to in-corpus words. In this work, we argue in favor of properly modeling the unigram distribution -- claiming it should be a central task in natural language processing. With this in mind, we present a novel model for estimating it in a language (a neuralization of Goldwater et al.'s (2011) model) and show it produces much better estimates across a diverse set of 7 languages than the na\"ive use of neural character-level language models.


Introduction
Neural networks have yielded impressive gains in sentence-level language modeling across a typologically diverse set of languages (Mikolov et al., 2010;Kalchbrenner et al., 2016;Merity et al., 2018;Melis et al., 2018;Cotterell et al., 2018). Similarly, neural networks constitute the state of the art in modeling the distribution over a language's word types (Pimentel et al., 2020), outperforming non-neural generative models such as Futrell et al.'s (2017) with character-level models. This paper focuses on a less-researched task that is halfway between sentence-level language modeling and word type distributions: Modeling the unigram distribution, the distribution over word tokens in a language consisting of the probability of a word's form as well as its frequency in the language. In particular, as opposed to sentence-level modeling, the unigram distribution does not consider contextual information. * Equal contribution The unigram distribution is a central object in the science of language from historical linguistics to psycholinguistics and beyond (Baayen et al., 2016;Diessel, 2017;Divjak, 2019). However, the majority of research on unigram distributions is based on identifying this distribution with sample frequency. This approach results in poor estimates, as it assigns zero probability to out-of-vocabulary words. 1 Further, it is highly dependent on sample size (Baayen, 2002) The core contribution of our work is motivating the unigram distribution as a worthwhile objective for scientific inquiry-one which is currently understudied in the field. With that in mind, we also present a neuralization of Goldwater et al.'s (2011) two-stage model. 2 The gist of this approach is using two components to model the Zipfian distribution of word tokens in a language (Zipf, 1935) separately from its phono-or graphotactic distribution. The first component, termed the adaptor, is 1 In the Turkish Wikipedia, for example, considering a training set of 8 million and a test set of 1 million tokens, 27.4% of test types and 5.3% of test tokens are out-of-vocabulary. Note that, according to Heaps' law, a similar behavior would be expected from corpora of any size (Herdan, 1960;Heaps, 1978).
2 While Goldwater et al. (2011) acknowledge that their model could be used in various tasks of learning linguistic structure, they only present results in modeling morphology. based on the Pitman-Yor process (PYP;Pitman and Yor, 1997), and has the ability to model the powerlaw behavior of word tokens. The second, termed the generator, leverages a character-level neural language model to capture structural patterns in written words, e.g. graphotactics and morphology.
Critically, naïvely training a character-level neural model in either types (i.e. unique word forms) or tokens (i.e. word forms in their original frequencies) should lead to degenerate results. Models trained on natural corpora (i.e. token data) should excel in modeling the most common words of a language, but might poorly approximate the set of infrequent word forms which individuals dynamically produce (e.g. through compositional morphology). On the other hand, training models on the collection of unique word forms (i.e. type data) would give equal weight to typical and atypical productions, potentially leading to poor performance on the most frequent forms, which any individual would recognize as part of their language. In the two-stage model, as we will show, our generator is trained on a dataset interpolated between types and tokens-modeling the nuance between frequent and infrequent word forms better. By testing our model on a set of languages with diverse morphological and phonological characteristics, we find that it is capable of modeling both frequent and infrequent words, thus producing a better estimate of the unigram distribution than a character-level LSTM. The empirical superiority of our two-stage model is shown in Fig. 1, where the surprisal (i.e. the negative log-probability, measured in nats here) of each token is plotted under four different models for Finnish. Our proposed two-stage model achieves a lower or similar surprisal to the baselines on tokens with all frequencies-with similar patterns arising in all analyzed languages. 3

The Unigram Distribution
The unigram distribution is a probability distribution over the possible word forms in a language's lexicon. This probability takes the frequency of a token into account, assigning larger probabilities to word forms which are more likely to be encountered in a language's utterances, thus differing from word type distributions, such as in Pimentel et al. (2020). It is also not conditioned on a word's context, as it considers each word token as a stand-alone unit, as opposed to the task of language modeling, e.g. Mikolov et al. (2010).

Complex Vocabularies
The composition of spoken vocabularies is structured according to a host of factors. Stemming from articulatory biases, each language has a set of constraints on what sequences of speech sounds can be valid words in it; this is termed the phonotactics of a language. Languages also exhibit small but non-negligible biases in the regular match of forms and meanings (Dingemanse et al., 2015;Pimentel et al., 2019Pimentel et al., , 2021b. Additionally, expectations about morphology can constrain the production or processing of a given word as belonging to a particular word class (as shown for instance in Jabberwocky-and wug-type tasks, Berko 1958, Hall Maudslay andCotterell 2021).
While individuals often have strong intuitions about these patterns, their judgments are typically gradient rather than categorical (Hayes and Wilson, 2008;Gorman, 2013). The effective set of words that naturally occur in linguistic productions are known to be extremely diverse in their composition. Models deployed to explain and predict typical word forms in a given language might fail at capturing these corners of the space of possible forms. If the goal is to produce ecologically valid models that could approximate actual cognitive processes, these atypical forms should be efficiently learned in addition to the most typical productions.

Imbalanced Frequencies
Zipf (1935) popularized the observation that the frequency of a word in a corpus is inversely proportional to its rank, approximately following a power-law distribution. As such, a small subset of the most common word types dominate the corpus. These extremely frequent words tend to be short in length and exceptionally archaic, in the sense that they preserve traces of previous phonotactic and phonological profiles that might have ceased to be productive. This is particularly relevant when we consider scenarios where substantial portions of the vocabulary might have been borrowed from different sources over time. English is a textbook example: Williams (1986) reports that French, Latin, Germanic and Greek account for 29%, 29%, 26% and 6% of all words' origins in the vocabulary (plus a remaining 10% of diverse origin). The most frequent portion of the vocabulary preserves the most the original West Germanic forms, consisting largely of articles, prepositions, pronouns, and auxiliaries. Further, irregular inflections tend to be more common in these highly frequent words (Ackerman and Malouf, 2013;Cotterell et al., 2019). This observation might invite one to omit frequency information from training data, i.e. to use types, in order to balance out the role of the most frequent words.
On the other side of the frequency scale, however, any natural language data would have plenty of low-frequency words that reflect the open boundaries of the vocabulary. These might include nonce words (blick), expressive transformations of other words (a loooooooooooong summer), specialized terms (onabotulinumtoxina), and names, among others. In addition, genuine orthographic misproductions (langague) will be present to some degree.
Finally, acronyms (HTML) will be present in all frequency bands. These should be particularly problematic to model, since they do not necessarily follow the language's graphotactics to any degree. There are also frequent and infrequent loanwords with different degrees of adjustment to the graphoand phonotactics of the rest of the vocabulary. For instance, it has been estimated that 96% and 21% of English speakers know the Afrikaans-originated words aardvark and aardwolf, respectively (Brysbaert et al., 2019). 4 These are the only written word forms in English with a non-negligible frequency that display two letter 'a's in word-initial position.
This whimsical nature of the vocabulary of a language makes modeling the unigram distribution challenging: Naïvely training a model to capture word forms at either the token or type level is likely to give disproportionate emphasis to phonotactically unrepresentative words. However, this is also why its modeling is a worthwhile task-it captures both frequent and rare productions, combining form probability with frequency information.
the generator, is a model used to produce a set of i.i.d. word forms { k } K k=1 . The second component is termed adaptor, and it assigns each instance in the training set to a cluster {z n } N n=1 . Under this model, each token in a dataset has a corresponding cluster z n which defines the token's word form w n = zn . We note that both word forms and clusters z are latent variables, and only tokens w are observed during training.
Generator. The generator is a model which produces word forms; we use a character-level LSTM here (Hochreiter and Schmidhuber, 1997), as in: 6 These word forms k are sampled i.i.d.-thus, the same word may be sampled more than once.
Adaptor. Each word form sampled from the generator corresponds to a cluster. The adaptor then assigns a frequency to each of the clusters according to a Pitman-Yor process: where 0 ≤ a < 1 and 0 ≤ b are hyperparameters of the PYP, z <n are the previous cluster assignments, K <n is the current number of clusters with at least one token and c (zn) <n is the number of tokens previously assigned to cluster z n . This adaptor, as a Pitman-Yor process, allows us to model the power-law distribution of word tokens.
Two-stage Model. Given a cluster assignment and the list of word forms, defining a token's form is deterministic: p(w n | z n , ) = 1{w n = zn }. Thus, our model factorizes a new token's probability into two terms: where c w is the number of occurrences of word form w in our training corpus and n w is the number of distinct clusters to which it has been assigned: In practice, the two-stage model acts as an interpolation between a smoothed 1-gram model, i.e. corpus frequencies, and an LSTM character model. Notably, this model learns per-word smoothing factors and its interpolation weight in an unsupervised manner through the PYP parameters' inference. The adaptor is fit using Gibbs sampling, and the generator is trained using a cross-entropy loss on the set of non-empty clusters produced by the adaptor. The generator is thus trained using a more balanced corpus where the proportion of the most frequent words is reduced; this can be seen as an interpolation between a type and a token dataset. 7 4 Experiments.
Dataset. We use Wikipedia data and evaluate our model on the following languages: English, Finnish, Hebrew, Indonesian, Tamil, Turkish and Yoruba. These languages represent a typologically diverse set-with different levels of morphology, ranging from rich (e.g. Finnish) to poor (e.g. Yoruba), as well as distinct scripts and graphotactic patterns. In preprocessing, we first split the data into sentences and then into tokens using spaCy (Honnibal et al., 2020). We then sample 10 6 tokens as our training set for each language (except for Yoruba for which we had less data, see App. F for more details). From these, we build two distinct datasets: a token dataset, which corresponds to the list of word forms with their corpus frequency, and a type dataset containing the set of unique word forms in the data.
Evaluation. We measure the cross-entropy of our models on a held-out test set; this is the standard evaluation for language modeling. We approximate this cross-entropy using a sample mean estimate  where we assume instances w n are sampled from the true unigram distribution p(w). Specifically, these token samples {w n } N n=1 take the form of the token dataset. The model with the lowest crossentropy is the one that diverges the least from the true distribution.
Baseline Models. As neural networks yield stateof-the-art performance in language modeling tasks, we expect them to also do well with the unigram distribution. In fact, pseudo-text generated by LSTM-based language models reproduces Zipf's law to some extent (Takahashi and Tanaka-Ishii, 2017;Meister and Cotterell, 2021). Thus, we view state-of-the-art LSTM models as a strong baseline. We train a character-level LSTM language model (Pimentel et al., 2020) to directly approximate the unigram distribution by training it on the token dataset-modeling these tokens at the character level. As a second baseline, we train an LSTM on the type dataset. However, we expect this model to be outperformed by the token one in the unigram distribution task, as the information on word frequency is not available during its training. We do not use a word-level 1-gram model (i.e. the words' sample frequency) as a baseline here, since it results in an infinite cross-entropy for any test set containing oov words. We empirically compare four models: two-stage, generator, token, and type.
Modeling Tokens. Cross-entropy on the token test sets can be found in Tab. 1. These results show our two-stage model indeed creates a more accurate estimate of the unigram distribution, producing the smallest cross-entropy across all languages.
Frequent vs Infrequent Words. The weaknesses of the token and type models are evinced by Fig. 1. In line with our hypothesis, the token model achieves lower cross-entropy on the most common words, but fails to model the rare ones accurately.  The cross-entropy achieved by the type model does not change as much with word frequency, but is higher than the one achieved by the token model for most of the vocabulary. We also see that the two-stage model performs well across all word frequencies. Indeed, this model appears to behave similarly to the token model with frequent words, but obtains a lower cross-entropy on the rare ones, where the role of the generator in the estimated probability is emphasized. We suspect this is the reason behind the two-stage model's success.
The Long Tail. Fig. 1 also demonstrates that the entropy estimate for the rare words grows quickly and exhibits a large variance across models. This reflects the heterogeneous nature of the words that only appear a few times in a corpus. This part of the vocabulary is where the type model achieves the best results for all languages except Yoruba (see Tab. 2). 8 The fact that singletons (also known as hapax legomena), i.e. word forms which occur only once in the test set, form a large portion of the type dataset boosts the type model's performance on rare words. However, in the case of words appearing more than once (see Tab. 3) the two-stage model achieves the best results across languages. Furthermore, in these non-singleton words, the generator outperforms the type and token models in all languages except for Yoruba. This shows the utility of training the generator on an interpolation between types and tokens. In addition, we note that one may justifiably question whether properly modeling singletons is a desirable feature, since they are likely to contain unrepresentative word forms, such as typos, as discussed previously. Indeed, it appears that the two-stage model not only leads to tighter estimates of the unigram distribution, but also allows us to train a better graphotactics model; 8 We note that we used considerably less training data for Yoruba than for other languages.  capable of modeling both frequent word forms as well as new productions.
Future Work. The results we present focus on analyzing the two-stage model. The generator, though, produces interesting results by itself, modeling non-singleton word forms better than the type and token models in most languages. This suggests that it might be better at modeling the graphotactics of a language than either of these baselines. Future work should explore if this indeed is the case.

Conclusion
In this work, we motivate the unigram distribution as an important task to both the psycholinguistics and natural language processing communities that has received too little attention. We present a twostage model for estimating this distribution-a neuralization of Goldwater et al.'s (2011)-which is motivated by the complex makeup of vocabularies: This model defines the probability of a token by combining the probability of its appearance in the training corpus with the probability of its form. We have shown, through a cross-entropy evaluation, that our model outperforms naïve solutions and is capable of accurately modeling both frequent and infrequent words.

B Hyperparameter Search
The same hyperparameters are used for both our baseline LSTMs and the generator. We use 3 layers, where embedding size is 128, hidden size is 512, and dropout probability is 0.33. Training the two-stage model takes a considerable amount of time (see Tab. 5). We are thus not capable of doing exhaustive hyperparameter tuning. Random search (Bergstra and Bengio, 2012) is used in tuning the values for a and b, where we run five training procedures considering ranges a ∈ [0, 1), and b ∈ [100, 200,000). We tune the hyperparameters for each language by minimizing the model's crossentropy on the development set, training them on a subset of the training data with only 100,000 tokens. The found optimal values of a and b are rounded to two decimal places and the thousands respectively. Our two-stage model is trained for five iterations of expectation-maximization.

E Inference
Unfortunately, there is no closed form solution for inferring the parameters of our two-stage model. In order to obtain a sample of cluster assignments and train the generator to match their labels, we estimate the parameters of both the generator and the adaptor concurrently, freezing one's parameters while training the other. We use a regime corresponding to the Monte Carlo Expectationmaximization (EM) algorithm to train the model (Wei and Tanner, 1990), which can be found in Algorithm 1. In the E-step, the function GIBB-SSAMPLER returns the cluster assignments z and the dampened word dataset obtained via Gibbs sampling from the PYP. We then use this dampened dataset to train the generator in the M-step. for t = 1 up to T do end for 8: end for

E.1 Gibbs Sampler For Cluster Assignments
The Pitman-Yor process does not have a welldefined posterior probability. Nonetheless, we can use Gibbs sampling to obtain a sample from this posterior distribution over cluster assignments de-fined by the two-stage model. 9 We build our sampler after the morphological sampler presented by Goldwater et al. (2011).
Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method which approximates the posterior of a multivariate distribution. It iteratively samples from the conditional distribution of a variable, given the values of the other dimensions (Neal, 1993). We use the conditional distribution defined in eq. (7) (presented in Fig. 2) in the Gibbs sampler where we know the word form w n of token nsince it is observable in the corpus-and where the values for all other cluster assignments are fixed. Note that, according to eq. (7), we only assign word tokens to clusters with the same form or create a new cluster-and when a new one is created, its word form is assigned to w n . As such, each cluster contains a single shared word form. For each adaptor training iteration, we run the Gibbs sampler for six epochs, and choose the cluster assignments that have the best performance on a development set. Furthermore, we persist the adaptor state across iterations, warm starting the Gibbs sampler with the cluster assignments of the previous iteration.

E.2 Training the generator
In order to train the generator on word form data with more balanced frequency distributions, a new training set is dynamically created. In this dataset, each token appears as many times as it has been assigned as a cluster label, noted with in Algorithm 1. 10 A regime similar to using the inversepower transformed counts of the tokens in the corpus (Goldwater et al., 2011).
This new training set allows us to train the generator in an interpolation between a purely typeor token-based dataset; this interpolation can be controlled through its parameters a and b. Setting the values of a and b to zero will cause the model to favor existing clusters to creating new ones, resulting in assigning every token with the same form to a single cluster. In this case, the generator parameters would be estimated using the equivalent of a type corpus. Similarly, when a approaches one, or in the limit of b → ∞, less tokens will be assigned per cluster and the number of single token clusters grows. This is effectively equivalent to training the generator using tokens. Consequently, non-extreme p(z n | z <n , w n ) ∝ p(z n , w n | z <n ) ∝ (c (zn) <n − a) · 1{w n = zn } 1 ≤ z n ≤ K <n (a · K <n + b) · p φ (w n ) z n = K <n + 1 (7) Figure 2: The probability of assigning token w n to cluster z n in the two-stage model given all other cluster assignments z <n .
value of a and b are a middle ground. We train the character-level LSTM used as our generator with stochastic gradient descent using a cross-entropy loss function. This model is trained with early stopping; it is evaluated every 200 batches, and training stops when the development set loss has increased for 5 consecutive epochs.

E.3 Training Optimizations
The naïve implementation of the Gibbs sampler for table assignments quickly becomes computationally expensive in practice. Consequently, we use the optimized algorithm designed by Blunsom et al. (2009) for the hierarchical Dirichlet process in our implementation, extending it to Pitman-Yor processes with the additional parameter a.

F Dataset
As noted in the main text, we use Wikipedia data in our experiments. The amount of sentences used in our experiments is capped to one billion after shuffling them. Additionally, we define an upper bound to the amount of tokens used in each experiment. In case the training data exceed this limit, we construct a corpus by re-sampling (with replacement) the desired number of tokens using the corpus frequencies calculated from the original training corpus. The number of types and tokens used in training and evaluation are presented in Tab. 7. Noise in the Wikipedia data is somewhat reduced by hand-defining an alphabet for each language, and removing any sentence which includes words with invalid graphemes in it. 11 11 We define the alphabets using the languages' Wikipedia articles and the following website: https://r12a. github.io/app-charuse/.