Incorporating tone in the calculation of phonotactic probability

This paper investigates how the ordering of tone relative to the segmental string influences the calculation of phonotactic probability. Trigram and recurrent neural network models were trained on syllable lexicons of four Asian syllable-tone languages (Mandarin, Thai, Vietnamese, and Cantonese) in which tone was treated as a segment occurring in different positions in the string. For trigram models, the optimal permutation interacted with language, while neural network models were relatively unaffected by tone position in all languages. In addition to providing a baseline for future evaluation, these results suggest that phonotactic probability is robust to choices of how tone is ordered with respect to other elements in the syllable.


Introduction
The phonotactic probability of a string is an important quantity in several areas of linguistic research, including language acquisition, wordlikeness, word segmentation, and speech production and perception (Bailey and Hahn, 2001;Daland and Pierrehumbert, 2011;Storkel and Lee, 2011;Vitevitch and Luce, 1999). When the language of interest is a tone language, the question arises of how tone should be incorporated into the probability calculation. As phonotactic probability is frequently computed based on some type of ngram model, this means deciding on which segment(s) the probability of a tone should be conditioned. For instance, using a bigram model, one might compute the probability of the Mandarin syllable fāng as P (a|f) × P (N|a) × P (tone 1|N), but could just as well consider P (tone 1|f) × P (a|1) × P (N|a), or any other conceivable permutation of tone and segments.
While this issue is occasionally remarked on (e.g. Newman et al., 2011: 246), there remains no widespread consensus in practice. Choice of ordering is sometimes justified based on segment-tone co-occurrence restrictions in the language under study (Myers and Tsay, 2005), but is often presented without justification (Kirby and Yu, 2007;Yang et al., 2018), and in some cases tone is simply ignored (Gong, 2017). When the space of possibilities is considered, researchers generally select the permutation which maximizes model fit to some external data, such as participant judgments of phonological distance (Do and Lai, 2021a) or wordlikeness (Do and Lai, 2021b).
Although extrinsic evaluation is in some sense a gold standard, intrinsic metrics of model fit can also be informative, in part because extrinsic metrics are not always robust across data sets. For instance, participant wordlikeness judgments can vary considerably based on the particulars of the experimental design (Myers and Tsay, 2005;Shademan, 2006;Vitevitch and Luce, 1999), so the treatment of tone that produces a best-fit model for one dataset may not do so for another. The lexicon of a given language is much more internally stable in terms of how segments and tones are distributed, so intrinsic evaluation may provide a useful baseline for reasoning about the treatment of tone relative to segments both within and across languages.
This short paper considers a simple information-theoretic motivation for selecting a permutation: all else being equal, we should prefer a model that maximizes the probability of the lexicon (i.e., minimizes the cross-entropy loss), because this will be the model that by definition does the best job of capturing the phonotactic regularities of the lexicon (Cherry et al., 1953;Goldsmith, 2002;Pimentel et al., 2020). By treating tone as another phone in the segmental string, we can see whether and to what degree this choice has an effect on the overall entropy of the lexicon. Intuitively, any model that can take into account phonotactic constraints will result in a reduction in entropy. Thus, even an n-gram model with a sufficiently large context window should in principle be able model segment-tone co-occurrences at the syllable level. However, tone languages differ with respect to tonesegment co-occurrence restrictions (see Sec. 2). If a relevant constraint primarily targets syllable onsets, for instance, placing the tonal "segment" in immediate proximity to the onset will increase the probability of the string, even relative to a model capable of capturing the dependency at a longer distance.

Languages
Four syllable-tone languages were selected for this study: Mandarin Chinese, Cantonese, Vietnamese and Thai. They are partially a convenience sample in that the necessary lexical resources were readily available, but also have some useful similarities: all share a similar syllable structure template and have five or six tones. However, the four languages vary in terms of their segment-tone co-occurrence restrictions, as detailed below.
In all cases, the lexicon was defined as the set of unique syllable shapes in each language. For consistency, the syllable template in all four languages is considered to be (C 1 )(C 2 )V(C)T, with variable positioning of T. Offglides were treated as codas in all languages. The syllable lexicons for all four languages are provided in the supplementary materials (http://doi.org/10.17605/OSF.IO/NA5FB).
Thai (tha) A Thai lexicon of 4,133 unique syllables was created based on the dictionary of Haas (1964) which contains around 19,000 entries and 47,000 syllables. The phonemic representation encodes 20 onsets, 3 medials /w l r/, 21 nuclei (vowel length being contrastive in Thai), 8 codas and 5 tones. In Thai, high tone is rare/unattested following unaspirated and voiced onsets, but there is also statistical evidence for a restriction on rising tones with these onsets (Perkins, 2013). In syllables with an obstruent coda (/p t k/), only high, low, or falling tones occur, depending on length of the nuclear vowel (Morén and Zsiga, 2006).

Vietnamese (vie)
The Vietnamese lexicon of 8,128 syllables was derived from a freely available dictionary of around 74,000 words (Đức, 2004), phonetized using a spelling pronunciation (Kirby, 2008). The resulting representation encodes 24 onsets, 1 medial (/w/), 14 nuclei, 8 codas and 6 tones. Vietnamese syllables ending in obstruents /p t k/ are restricted to just one of two tones.

Cantonese (yue)
The Cantonese syllabary consists of the 1,884 unique syllables in the Chinese Character Database (Kwan et al., 2003), encoded using the jyutping system. This representation distinguishes 22 onsets, 1 medial (/w/), 11 nuclei, 5 codas and 6 tones. In Cantonese, unaspirated initials do not occur in syllables with low-falling tones, and aspirated initials do not occur with the low tone. Syllables ending with /p t k/ are restricted to one of the three "entering" tones (Yue-Hashimoto, 1972).

Methods
Two classes of character-level language models (LMs) were considered: simple n-gram models and recurrent neural networks (Mikolov et al., 2010). In an n-gram model, the probability of a string is proportional to the conditional probabilities of the component n-grams: The degree of context taken into account is thus determined by the value chosen for n. In a recurrent neural network (RNN), the next character in a sequence is predicting using the current character and the previous hidden state. At each step t, the network retrieves an embedding for the current input x t and combines it with the hidden layer from the previous step to compute a new hidden layer h t : where W is the weight matrix for the current time step, U the weight matrix for the previous time step, and g is an appropriate nonlinear activation function. This hidden layer h t is then used to generate an output layer y t , which is passed through a softmax layer to generate a probability distribution over the entire vocabulary. The probability of a sequence x 1 , x 2 . . . x z is then just the product of the probabilities of each character in the sequence: The incorporation of the recurrent connection as part of the hidden layer allows RNNs to avoid the problem of limited context inherent in n-gram models, because the hidden state embodies (some type of) information about all of the preceding characters in the string. Although RNNs cannot capture arbitrarily longdistance dependencies, this is unlikely to make a difference for the relatively short distances involved in phonotactic modeling.
Trigram models were built using the SRILM toolkit (Stolcke, 2002), with maximum likelihood estimates smoothed using interpolated Witten-Bell discounting (Witten and Bell, 1991). RNN LMs were built using PyTorch (Paszke et al., 2019), based on an implementation by Mayer and Nelson (2020). The results reported here make use of simple recurrent networks (Elman, 1990), but similar results were obtained using an LSTM layer (Hochreiter and Schmidhuber, 1997).

Procedure
The syllables in each lexicon were arranged in 5 distinct permutations: tone following the coda (T|C), nucleus (T|N), medial (T|M), onset (T|O) and with tone as the initial segment in the syllable (T|#). As many syllables in these languages lack onsets, medials, and/or codas, a sizable number of the resulting strings were identical across permutations. Both smoothed trigram and simple RNN LMs were then fit to each permuted lexicon 10 times, with random 80/20 train/dev splits (other splits produced similar results). For each run, the perplexity of the language model on the dev set D = x 1 x 2 . . . x N (i.e., the exponentiated cross-entropy 1 ) was recorded:

Results
For brevity, only the main findings are summarized here; the full results are available as part of the online supplementary materials (http://doi.org/10.17605/OSF.IO/NA5FB). Table 1 show the orderings which minimized perplexity for each method and language, averaged over 10 runs. Differences between orderings were then assessed visually, aided by simple analyses of variance. For the trigram LMs, perplexity was lowest in Mandarin when tones followed codas, while differences in perplexity between other orderings were negligible. For Thai, Vietnamese, and Cantonese, all orderings were roughly comparable except for when tone was ordered as the first segment in the syllable (T|#), which increased perplexity by up to 1 over the mean of the other orderings. For Thai, the ordering T|M resulted in significantly lower perplexities compared to all other  permutations. For the RNN LMs, although T|M was the numerically optimal ordering for three out of the four languages, in practical terms permutation had no effect on perplexity, with numerical differences of no greater than 0.1 (see Table 2).

Discussion
Consistent with other recent work in computational phonotactics (e.g. Mayer and Nelson, 2020;Mirea and Bicknell, 2019;Pimentel et al., 2020), the neural network models outperformed the trigram baselines by a considerable margin (a reduction in average perplexity of up to 2.5, depending on language). Neural network models were also much less sensitive to the linear position of tone relative to other elements in the segmental string (cf. Do and Lai, 2021b), no doubt due to the fact that the ability of the RNNs to model cooccurrence tendencies within the syllable is not constrained by context in the way that n-gram models are. Perhaps as a result, however, the RNN models reveal little about the nature of segmenttone co-occurrence restrictions in any of the languages investigated. In this regard, the trigram models, while clearly less optimal in a global sense, are still informative. The fact that the ordering T|# was significantly worse under the trigram model for Cantonese, Vietnamese and Thai but not Mandarin can be explained (or predicted) by the fact that of the four languages, only Mandarin does not permit obstruent codas, and consequently has no coda-tone co-occurrence restrictions (indeed, the four primary tones of Mandarin occur with more or less equal type frequency). In the other three languages, syllables with obstruent codas can only bear a restricted set of tones, and in a trigram model, this dependency is not modeled when tone is prepended to the syllable, since this means it will frequently, though not always, fall outside the window visible to the language model. Even a model with a large enough context window to capture such dependencies will assign the lexicon a higher perplexity when structured in this way.
The finding that the T|M ordering is always optimal in Thai (and by a larger margin than in the other languages) is presumably due to the fact that the distribution of the medials /w l r/ is severely restricted in this language, occurring only after /p p h t t h k k h f/. The distribution of tones after onset-medial clusters is inherently more constrained and therefore more predictable. A similar restriction holds in Cantonese, albeit to a lesser degree (the medial /w/ only occurs with onsets /k/ and /k h /).

Shortcomings and extensions
This work did not explore representations based on phonological features, given that their incorporation has failed to provide evaluative improvements in other studies of computational phonotactics (Mayer and Nelson, 2020;Mirea and Bicknell, 2019;Pimentel et al., 2020). However, feature-based approaches can be both theoretically insightful and may even prove necessary for other quantifications, such as the measure of phonological distance where tone is involved (Do and Lai, 2021a).
The present study has focused on a small sample of structurally and typologically similar languages. All have relatively simple syllable structures in which one and only one tone is associated with each syllable. Not all tone languages share these properties, however. In so-called "word-tone" languages, such as Japanese or Shanghainese, the surface tone with which a given syllable is realized is frequently not lexically specified. In other languages, such as Yoloxóchitl Mixtec (DiCanio et al., 2014), tonal specification may be tied to sub-syllabic units, such as the mora. Finally, data from many other languages, such as Kukuya (Hyman, 1987), make it clear that in at least in some cases tones can only be treated in terms of abstract melodies, which do not have a consistent association to syllables, moras, or vowels (Goldsmith, 1976). In these and many other cases, careful consideration of the theoretical motivations justifying a particular representation are required before it makes sense to consider ordering effects.
However, to the extent that it is possible to generate a segmental representation of a tone language in which surface tones are indicated, what the present work suggests is that the precise ordering of the tonal symbols with respect to other symbols in the string is unlikely to have a significant impact on phonotactic probability. This follows from two assumption (or constraints): first, that the set of symbols used to indicate tones is distinct from those used to indicate the vowels and consonants; and second, that one and only one such tone symbol appears per string domain (here, the syllable). If these two constraints hold, the complexity of the syllable template should in general have a greater impact on the entropy of the string set than the position of the tone symbol, although the number of unique tone symbols relative to the number of segmental symbols may also have an effect. According to Maddieson (2013) and Easterday (2019), languages with complex syllable structures (defined as those permitting fairly free combinations of two or more consonants in the position before a vowel, and/or two or more consonants in the position after the vowel) rarely have complex tone systems, or indeed tone systems at all, so this is unlikely to be an issue for most tone languages.
One possibility the present work did not address is whether it is even necessary, or desirable, to include tone in phonotactic probability calculations in the first place. The probability of the lexicon of a tonal language would surely change if tone is ignored, but whether listeners' judgments of a sequence as well-or ill-formed is better predicted by a model that takes tone into account vs. one that does not is an empirical question (but see Kirby and Yu, 2007;Do and Lai, 2021b for some evidence that it may not). Similarly, for research questions focused on tone sandhis, or on the distributions of the tonal sequences themselves (tonotactics), the relevant computations will be restricted to the tonal tier in the first instance, and ordering with respect to segments may simply not be relevant (but see Goldsmith and Riggle, 2012).
Finally, the present study has focused on the lexical representation of tone, but in many languages tone primarily serves a morphological function. The SIGMORPHON 2020 Task 0 shared challenge (Vylomova et al., 2020) included inflection data from several tonal Oto-Manguean languages in which tone was orthographically encoded in different ways via string diacritics. While the authors noted the existence these differences, it is unclear whether and to what extent the different representations of tones affected system performance. Similarly, the potential impact of tone ordering relative to other elements in the string has yet to be systematically investigated in this setting.

Conclusion
This paper has assessed how different permutations of tone and segments affects the perplexity of the lexicon in four syllable-tone languages using two types of phonotactic language models, an interpolated trigram model and a simple recurrent neural network. The perplexities assigned by the neural network models were essentially unaffected by different choices of ordering; while the trigram model was more sensitive to permutations of tone and segments, the effects on perplexity remained minimal. In addition to providing a baseline for future evaluation, these results suggest that the phonotactic probability of a syllable is relatively robust to choice of how tone is ordered with respect to other elements in the string, especially when using a model capable of encoding dependencies across the entire syllable.