Jin-seo Kim


2024

pdf bib
KoFREN: Comprehensive Korean Word Frequency Norms Derived from Large Scale Free Speech Corpora
Jin-seo Kim | Anna Seo Gyeong Choi | Sunghye Cho
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Word frequencies are integral in linguistic studies, showing strong correlations with speakers’ cognitive abilities and other important linguistic parameters including the Age of Acquisition (AoA). However, the formulation of credible Korean word frequency norms has been obstructed by the lack of expansive speech data and a reliable part-ofspeech (POS) tagger. In this study, we unveil Korean word frequency norms (KoFREN), derived from large-scale spontaneous speech corpora (41 million words) that include a balanced representation of gender and age. We employed a machine learning-powered POS tagger, showcasing accuracy on par with human annotators. Our frequency norms correlate significantly with external studies’ lexical decision time (LDT) and AoA measures. KoFREN also aligns with English counterparts sourced from SUBTLEX_US - an English word frequency measure that has been frequently used in the literature. KoFREN is poised to facilitate research in spontaneous Contemporary Korean and can be utilized in many fields, including clinical studies of Korean patients.

pdf bib
Crosslinguistic Acoustic Feature-based Dementia Classification Using Advanced Learning Architectures
Anna Seo Gyeong Choi | Jin-seo Kim | Seo-hee Kim | Min Seok Back | Sunghye Cho
Proceedings of the Fifth Workshop on Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments @LREC-COLING 2024

In this study, we rigorously evaluated eight machine learning and deep learning classifiers for identifying Alzheimer’s Disease (AD) patients using crosslinguistic acoustic features automatically extracted from one-minute oral picture descriptions produced by speakers of American English, Korean, and Mandarin Chinese. We employed eGeMAPSv2 and ComParE feature sets on segmented and non-segmented audio data. The Multilayer Perceptron model showed the highest performance, achieving an accuracy of 83.54% and an AUC of 0.8 on the ComParE features extracted from non-segmented picture description data. Our findings suggest that classifiers trained with acoustic features extracted from one-minute picture description data in multiple languages are highly promising as a quick, language-universal, large-scale, remote screening tool for AD. However, the dataset included predominantly English-speaking participants, indicating the need for more balanced multilingual datasets in future research.