Simon Todd


2023

pdf bib
Unsupervised part-of-speech induction for language description: Modeling documentation materials in Kolyma Yukaghir
Albert Ventayol-boada | Nathan Roll | Simon Todd
Proceedings of the Second Workshop on NLP Applications to Field Linguistics

This study investigates the clustering of words into Part-of-Speech (POS) classes in Kolyma Yukaghir. In grammatical descriptions, lexical items are assigned to POS classes based on their morphological paradigms. Discursively, however, these classes share a fair amount of morphology. In this study, we turn to POS induction to evaluate if classes based on quantification of the distributions in which roots and affixes are used can be useful for language description purposes, and, if so, what those classes might be. We qualitatively compare clusters of roots and affixes based on four different definitions of their distributions. The results show that clustering is more reliable for words that typically bear more morphology. Additionally, the results suggest that the number of POS classes in Kolyma Yukaghir might be smaller than stated in current descriptions. This study thus demonstrates how unsupervised learning methods can provide insights for language description, particularly for highly inflectional languages.

pdf bib
PSST! Prosodic Speech Segmentation with Transformers
Nathan Roll | Calbert Graham | Simon Todd
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

We develop and probe a model for detecting the boundaries of prosodic chunks in untranscribed conversational English speech. The model is obtained by fine-tuning a Transformer-based speech-to-text (STT) model to integrate the identification of Intonation Unit (IU) boundaries with the STT task. The model shows robust performance, both on held-out data and on out-of-distribution data representing different dialects and transcription protocols. By evaluating the model on degraded speech data, and comparing it with alternatives, we establish that it relies heavily on lexico-syntactic information inferred from audio, and not solely on acoustic information typically understood to cue prosodic structure. We release our model as both a transcription tool and a baseline for further improvements in prosodic segmentation.

2022

pdf bib
Unsupervised morphological segmentation in a language with reduplication
Simon Todd | Annie Huang | Jeremy Needle | Jennifer Hay | Jeanette King
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

We present an extension of the Morfessor Baseline model of unsupervised morphological segmentation (Creutz and Lagus, 2007) that incorporates abstract templates for reduplication, a typologically common but computationally underaddressed process. Through a detailed investigation that applies the model to Maori, the ̄ Indigenous language of Aotearoa New Zealand, we show that incorporating templates improves Morfessor’s ability to identify instances of reduplication, and does so most when there are multiple minimally-overlapping templates. We present an error analysis that reveals important factors to consider when applying the extended model and suggests useful future directions.