Beyond Distribution: Investigating Language Models’ Understanding of Sino-Korean Morphemes

Taehee Jeon


Abstract
We investigate whether Transformer-based language models, trained solely on Hangul text, can learn the compositional morphology of Sino-Korean (SK) morphemes, which are fundamental to Korean vocabulary. Using BERT_BASE and fastText, we conduct controlled experiments with target words and their “real” vs. “fake” neighbors—pairs that share a Hangul syllable representing the same SK morpheme vs. those that share only the Hangul syllable. Our results show that while both models—especially BERT—distinguish real and fake pairs to some extent, their performance is primarily driven by the frequency of each experimental word rather than a true understanding of SK morphemes. These findings highlight the limits of distributional learning for morpheme-level understanding and emphasize the need for explicit morphological modeling or Hanja-aware strategies to improve semantic representation in Korean language models. Our dataset and analysis code are available at: https://github.com/taeheejeon22/ko-skmorph-lm.
Anthology ID:
2025.findings-emnlp.569
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10762–10772
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.569/
DOI:
Bibkey:
Cite (ACL):
Taehee Jeon. 2025. Beyond Distribution: Investigating Language Models’ Understanding of Sino-Korean Morphemes. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 10762–10772, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Beyond Distribution: Investigating Language Models’ Understanding of Sino-Korean Morphemes (Jeon, Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.569.pdf
Checklist:
 2025.findings-emnlp.569.checklist.pdf