Phonotactic Complexity and Its Trade-offs

Tiago Pimentel, Brian Roark, Ryan Cotterell


Abstract
We present methods for calculating a measure of phonotactic complexity—bits per phoneme— that permits a straightforward cross-linguistic comparison. When given a word, represented as a sequence of phonemic segments such as symbols in the international phonetic alphabet, and a statistical model trained on a sample of word types from the language, we can approximately measure bits per phoneme using the negative log-probability of that word under the model. This simple measure allows us to compare the entropy across languages, giving insight into how complex a language’s phonotactics is. Using a collection of 1016 basic concept words across 106 languages, we demonstrate a very strong negative correlation of − 0.74 between bits per phoneme and the average length of words.
Anthology ID:
2020.tacl-1.1
Volume:
Transactions of the Association for Computational Linguistics, Volume 8
Month:
Year:
2020
Address:
Cambridge, MA
Editors:
Mark Johnson, Brian Roark, Ani Nenkova
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
1–18
Language:
URL:
https://aclanthology.org/2020.tacl-1.1
DOI:
10.1162/tacl_a_00296
Bibkey:
Cite (ACL):
Tiago Pimentel, Brian Roark, and Ryan Cotterell. 2020. Phonotactic Complexity and Its Trade-offs. Transactions of the Association for Computational Linguistics, 8:1–18.
Cite (Informal):
Phonotactic Complexity and Its Trade-offs (Pimentel et al., TACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.tacl-1.1.pdf
Code
 tpimentelms/phonotactic-complexity