Ryan Soh-Eun Shim


2024

pdf bib
Phonotactic Complexity across Dialects
Ryan Soh-Eun Shim | Kalvin Chang | David R. Mortensen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Received wisdom in linguistic typology holds that if the structure of a language becomes more complex in one dimension, it will simplify in another, building on the assumption that all languages are equally complex (Joseph and Newmeyer, 2012). We study this claim on a micro-level, using a tightly-controlled sample of Dutch dialects (across 366 collection sites) and Min dialects (across 60 sites), which enables a more fair comparison across varieties. Even at the dialect level, we find empirical evidence for a tradeoff between word length and a computational measure of phonotactic complexity from a LSTM-based phone-level language model—a result previously documented only at the language level. A generalized additive model (GAM) shows that dialects with low phonotactic complexity concentrate around the capital regions, which we hypothesize to correspond to prior hypotheses that language varieties of greater or more diverse populations show reduced phonotactic complexity. We also experiment with incorporating the auxiliary task of predicting syllable constituency, but do not find an increase in the strength of the negative correlation observed.

2022

pdf bib
dialectR: Doing Dialectometry in R
Ryan Soh-Eun Shim | John Nerbonne
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects

We present dialectR, an open-source R package for performing quantitative analyses of dialects based on categorical measures of difference and on variants of edit distance. dialectR stands as one of the first programmable toolkits that may freely be combined and extended by users with further statistical procedures. We describe implementational details of the package, and provide two examples of its use: one performing analyses based on multidimensional scaling and hierarchical clustering on a dataset of Dutch dialects, and another showing how an approximation of the acoustic vowel space may be achieved by performing an MFCC (Mel-Frequency Cepstral Coefficients)-based acoustic distance on audio recordings of vowels.

pdf bib
SIGMORPHON 2022 Task 0 Submission Description: Modelling Morphological Inflection with Data-Driven and Rule-Based Approaches
Tatiana Merzhevich | Nkonye Gbadegoye | Leander Girrbach | Jingwen Li | Ryan Soh-Eun Shim
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper describes our participation in the 2022 SIGMORPHON-UniMorph Shared Task on Typologically Diverse and AcquisitionInspired Morphological Inflection Generation. We present two approaches: one being a modification of the neural baseline encoderdecoder model, the other being hand-coded morphological analyzers using finite-state tools (FST) and outside linguistic knowledge. While our proposed modification of the baseline encoder-decoder model underperforms the baseline for almost all languages, the FST methods outperform other systems in the respective languages by a large margin. This confirms that purely data-driven approaches have not yet reached the maturity to replace trained linguists for documentation and analysis especially considering low-resource and endangered languages.