Learning the hyperparameters to learn morphology

We perform hyperparameter inference within a model of morphology learning (Goldwater et al., 2011) and ﬁnd that it affects model behaviour drastically. Changing the model structure successfully avoids the unsegmented solution, but re-sults in oversegmentation instead.


Introduction
Bayesian models provide a sound statistical framework in which to explore aspects of language acquisition.Explicitly specifying the causal and computational structure of a model enables the investigation of hypotheses such as the feasibility of learning linguistic structure from the available input (Perfors et al., 2011), or the interaction of different linguistic levels (Johnson, 2008a).However, these models can be sensitive to small changes in (hyper-)parameters settings.Robustness in this respect is important, since positing specific parameter values is cognitively implausible.
In this paper we revisit a model of morphology learning presented by Goldwater and colleagues in Goldwater et al. (2006) and Goldwater et al. (2011) (henceforth GGJ).This model demonstrated the effectiveness of non-parametric stochastic processes, specifically the Pitman-Yor Process, for interpolating between types and tokens.Language learners are exposed to tokens, but many aspects of linguistic structure are lexical; identifying which tokens belong to the same lexical type is crucial.Surface form is not always sufficient, as in the case of ambiguous words.Moreover, morphology in particular is influenced by vocabulary-level type statistics (Bybee, 1995), so it is important for a model to operate on both levels: token statistics from realistic (child-directed) input, and type-level statistics based on the token analyses.
The GGJ model learns successfully given fixed hyperparameter values in the Pitman-Yor Process.However, we show that when these hyperparameters are inferred, it collapses to a token-based model with a trivial morphology.In this paper we discuss the reasons for this problematic behaviour, which are relevant for other models based on Pitman-Yor Processes with discrete base distributions, common in natural language tasks.We investigate some potential solutions, by changing the way morphemes are generated within the model.Our results are mixed; we avoid the hyperparameter problem, but learn overly compact morpheme lexicons.

The Pitman-Yor Process
The Pitman-Yor Process G ∼ PYP(a, b, H 0 ) (Pitman and Yor, 1997;Teh, 2006) generates distributions over the space of the base distribution H 0 , with the hyperparameters a and b governing the extent of the shift from H 0 .Draws from G have values from H 0 , but with probabilities given by the PYP.For example, in a unigram PYP language model with observed words, H 0 may be a uniform distribution over the vocabulary, U ( 1 T ).The PYP shifts this distribution to the power-law distribution over tokens found in natural language, allowing words to have much higher (and lower) than uniform probability.We will continue using the language model example in this section, since the subsequent morphology model is effectively a complex unigram language model in which word types correspond to morphological analyses.In our presentation, we pay particular attention to the role of the hyperparameter a, since this value governs the power-law behaviour of the PYP (Buntine and Hutter, 2010).
When G is marginalised out, the result is the PYP Chinese Restaurant Process, which is a useful representation of the distribution of observations (word tokens) to values from H 0 (types).In this restaurant, customers (tokens) arrive and are seated at one of a potentially infinite number of tables.Each table receives a dish (type) from the base distribution when the first customer is seated there; thereafter all subsequent customers adopt the same dish.The probability of customer z i being seated at a table k depends on the number of customers already seated at that table n k .Popular tables will attract more customers, generating a Zipfian distribution over customers at tables.This Zipfian/power-law behaviour can be similar to that of the natural language data, and is the principal motivation behind using the PYP.However, it is only valid for the distribution of customers to tables.When the base distribution is discrete -as in our language model example and the morphology model -the same dish may be served at multiple tables.In most cases, the distribution of interest is generally that of customers (tokens) to dishes (types), rather than to tables, suggesting a preference for a setting in which each dish appears at few tables.This is dependent on a (constrained to be 0 ≤ a < 1), and to a lesser extent on b: If a is small, each dish will be served at a single table, resulting in the type-token and the table-customer power-laws matching.If a is near 1, however, the probability of more than a single customer being seated at a table is small, and the distribution of dishes being eaten by the customers will match the base distribution, rather than being adapted by the caching mechanism of the PYP.
The expected number of tables K grows as O(N a ) (see Buntine and Hutter (2010) for an exact formulation).The number of word types in the data gives us a minimum number of tables, K ≥ T .When a is small (less than 0.5), the number of expected tables is significantly less than the number of types in a non-trivial dataset, suggesting a lower bound for values of a.
In our language model, the posterior probability of assigning a word w i to a table k with dish k and n k previous customers is: where I(w i = k ) returns 1 if the token and the dish match, and 0 otherwise.We see that in order to prefer assigning customers to already occupied tables, we need H 0 (w)(Ka + b) < n k − a.Given K ≥ T , and setting H 0 = 1 T , we can approximate this with 1 T (T a + b) < n k − a. From this we obtain a < 1 2 (n k − b T ), which indicates that in order for tables with a single customer (n k = 1) to attract further customers, a must be smaller than 0.5.Thus, there is a tension between the number of tables required by the data and our desire to reuse tables.One solution is to fix a to an arbitrary, sufficiently small value, as GGJ do in their experiments.In contrast, in this paper we infer a and b along with the other parameters, and change the other free variable, the base distribution H 0 .

Morphology
The morphology model introduced by GGJ has a base distribution that generates not simply word types, as in the language model example, but morphological analyses.These are relatively simple, consisting of stem+suffix segmentation and a cluster membership.The probability of a word is the sum of the probability of all cluster c, stem s, suffix f tuples: with the stems and the suffixes being generated from cluster-specific distributions.In the GGJ model, all three distributions (cluster, stem, suffix) are finite conjugate symmetric Dirichlet-Multinomial (DirMult) distributions.We retain the DirMult over clusters, but change the morphemegenerating distributions.
The DirMult is equivalent to a Dirichlet Process prior (DP) with a finite base distribution; we use this representation because it allows us to replace the base distributions flexibly.A DP(α, H 0 ) is also equivalent to a PYP with a = 0, and thus also can be represented with a Chinese Restaurant Process, but in this case we sum over all tables to obtain the predictive probability of a (say) stem: Note that the counts m s refer to stems generated within the base distribution, not to token counts within the PYP.The original GGJ model, ORIG, is equivalent to setting H S for stems to U ( 1 S ), and likewise , where S and F are the number of possible stems and suffixes in the dataset (i.e., all possible prefix and suffix strings, including a null string).
There are two difficulties with this model.Firstly, it assumes a closed vocabulary and requires setting S and F in advance, by looking at the data.As a cognitive model, this is awkward, since it assumes a fixed, relatively small number of possible morphemes.
Secondly, when the PYP hyperparameters are inferred, a is set to be nearly 1, resulting in a model with as many tokens as tables.This behaviour is due to the interaction between vocabulary size and base distribution probabilities outlined in the previous section: this base distribution assigns relatively high probability to words, so new tables have high probability; as the number of tables increases (from its fairly large minimum), the optimal a for this table configuration also increases, resulting in convergence at the token-based model.
We investigate two alternate base distribution over stems and suffixes, both of which extend the space of possible morphemes, thereby lowering the overall probability of the observed words.
DP-CHAR generates morphemes by first generating a length l ∼ Poisson(λ).Characters are then drawn from a uniform distribution, c 0...l ∼ U (1/|Chars|).A morpheme's probability decreases exponentially by length, resulting in a strong preference for shorter morphemes.
DP-UNI simply extends the original uniform distribution to s and f ∼ U (1/1e6), in effect moving probability mass to a large number of unseen morphemes.It is thus similar to DP-CHAR without the length preference.

Inference
We follow the same inference procedure as GGJ, using Gibbs sampling.The sampler iterates between inferring each token's table assignment and resampling the table labels (see GGJ for details).Within the morphology base distribution, the prior for the DirMult over clusters is set to α k = 0.5.To replicate the original DirMult model1 , we set α s = 0.001S and α f = 0.001F .In the other models, α s = α f = 1.Within DP-CHAR, λ = 6 for stems, 0.5 for suffixes.1: Final values for a on the orthographic English and Spanish datasets, as well as the average number of tables for each word type.The 95% confidence interval across three runs is ≤ 0.01.
(Phonological Eve is similar to Orthographic Eve.)

Sampling Hyperparameters
We sample PYP a and b hyperparameters using a slice sampler2 .Previous work with this model has always fixed these values, generally finding small a to be optimal and b to have little effect.
In experiments with fixed hyperparameters, we set a = b = 0.1.
To sample the hyperparameters, we place vague priors over them: a ∼ Beta(1, 1) and b ∼ Gamma(10, 0.1).The slice sampler samples a new value for a and b after every 10 iterations of Gibbs sampling.

Datasets
Our datasets consist of the adult utterances from two morphologically annotated corpora from CHILDES, an English corpus, Eve (Brown, 1973), and a Spanish corpus, Ornat (Ornat, 1994).Morphology is marked by a grammatical suffix on the stem, e.g.doggy-PL.Words marked with irregular morphology are unsegmented.
We also use the phonologically encoded Eve dataset used by GGJ.This dataset does not exactly correspond to the orthographic version, due to discrepancies in tokenisation, so we are unable to evaluate this dataset quantitatively.

Results
For each setting, we report the average over three runs of 1000 iterations of Gibbs sampling without annealing, using the last iteration for evaluation.
Table 1 shows what happens when hyperparameters are inferred: ORIG finds a token-based solution, with as many tables as tokens, while DP-CHAR is the opposite, with a small a allowing for just over one table for each word type.DP-UNI is between these two extremes.b is consistently between 1 and 3, confirming it has little effect.
The effect of the hyperparameters can be seen in the morphology results, shown in Table 2. DP-CHAR is robust across hyperparameter values, finding the same type-based solution with fixed and inferred hyperparameters, while the other models have very different results depending on the hyperparameter settings.ORIG with fixed hyperparameters performs best, with the highest VM score (a clustering measure, Rosenberg and Hirschberg ( 2007)) and a level of segmentation close to the correct one.However, with inferred hyperparameters, this model severely undersegments: it finds the unsegmented maximum likelihood solution, where all tokens are generated from the stem distribution (Goldwater, 2007).
The models with alternate base distributions go to the other extreme, oversegmenting the corpus.As generating new morphemes becomes less probable, the pressure to find the most compact morpheme lexicon grows.This leads to oversegmentation due to many spurious suffixes.The length penalty in DP-CHAR exacerbates this problem, but it can be seen in the DP-UNI solutions as well, particularly when hyperparameters are fixed to encourage a type-based solution.

Conclusion
The base distribution in the original GGJ model assigned a relatively high probability to unseen morphemes, allowing the model to generate new analyses for seen words instead of reusing old analyses and leading to undersegented tokenbased solutions.The alternative base distributions proposed here were effective in finding type-based solutions.However, these over-segmented solutions clearly do not match the true morphology, indicating that the model structure is inadequate.
One reason may be that the model structure is overly simple.The model is faced with an arguably more difficult task than a human learner, who has access to semantic, syntactic, and phonological cues.Adding these types of information has been shown to help morphology learning in similar models (Johnson, 2008b;Sirts and Goldwater, 2013;Frank et al., 2013).
Similarly, the morphological ambiguity that is captured by a model operating over tokens (and ignored in better-performing models that allow only a single analysis for each word type: Poon et al. (2009); Lee et al. (2011); Sirts and Alumäe (2012)) can often be disambiguated using semantic and syntactic information.A model that generates a single analysis per meaningful (semantically and syntactically distinct) word-form could avoid the potential problems of spurious re-generation seen in the original GGJ model as well as the converse problem of under-generation in our alternatives.Such a model might also map onto the human lexicon (which demonstrably avoids both problems) in a more realistic way.

Table 2 :
Final morphology results.'Fix' refers to models with fixed PYP hyperparameters (a = b = 0.1), while 'Inf' models have inferred hyperparameters.% Seg shows the percentage of tokens that have a nonnull suffix, while |L| is the size of the morpheme lexicon.VM is shown with 95% confidence intervals.