On Hapax Legomena and Morphological Productivity

Janet Pierrehumbert, Ramon Granell


Abstract
Quantifying and predicting morphological productivity is a long-standing challenge in corpus linguistics and psycholinguistics. The same challenge reappears in natural language processing in the context of handling words that were not seen in the training set (out-of-vocabulary, or OOV, words). Prior research showed that a good indicator of the productivity of a morpheme is the number of words involving it that occur exactly once (the hapax legomena). A technical connection was adduced between this result and Good-Turing smoothing, which assigns probability mass to unseen events on the basis of the simplifying assumption that word frequencies are stationary. In a large-scale study of 133 affixes in Wikipedia, we develop evidence that success in fact depends on tapping the frequency range in which the assumptions of Good-Turing are violated.
Anthology ID:
W18-5814
Volume:
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology
Month:
October
Year:
2018
Address:
Brussels, Belgium
Editors:
Sandra Kuebler, Garrett Nicolai
Venue:
EMNLP
SIG:
SIGMORPHON
Publisher:
Association for Computational Linguistics
Note:
Pages:
125–130
Language:
URL:
https://aclanthology.org/W18-5814/
DOI:
10.18653/v1/W18-5814
Bibkey:
Cite (ACL):
Janet Pierrehumbert and Ramon Granell. 2018. On Hapax Legomena and Morphological Productivity. In Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 125–130, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
On Hapax Legomena and Morphological Productivity (Pierrehumbert & Granell, EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-5814.pdf