Morpheme Matters: Morpheme-Based Subword Tokenization for Korean Language Models

DongHyeok Lee; Jeongyeon Park; Kyungbeen Cho; Jae Sung Lee

Morpheme Matters: Morpheme-Based Subword Tokenization for Korean Language Models

DongHyeok Lee, Jeongyeon Park, Kyungbeen Cho, Jae Sung Lee

Abstract

Tokenization plays a crucial role in the performance of language models. However, most existing tokenizers rely on frequency-based segmentation, which fails to capture the morphological structure of languages and often leads to inefficient token representations. In this study, we propose a novel tokenization method that emphasizes the importance of Korean morphological structures in eojeol (Korean spacing unit). This method is designed to accommodate both inter-eojeol segmentation and intra-eojeol segmentation, enabling the selection of subwords based on morphemes. We pretrained a language model using the proposed method and evaluated its performance on Korean benchmark tasks. Experimental results demonstrate that the proposed method generally outperforms existing approaches. Notably, it produces significantly fewer tokens per input sequence, indicating its effectiveness and efficiency for Korean language modeling. The code is available at https://github.com/Dohy-Lee/mob.

Anthology ID:: 2026.eacl-short.22
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 297–306
Language:
URL:: https://aclanthology.org/2026.eacl-short.22/
DOI:
Bibkey:
Cite (ACL):: DongHyeok Lee, Jeongyeon Park, Kyungbeen Cho, and Jae Sung Lee. 2026. Morpheme Matters: Morpheme-Based Subword Tokenization for Korean Language Models. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 297–306, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Morpheme Matters: Morpheme-Based Subword Tokenization for Korean Language Models (Lee et al., EACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.eacl-short.22.pdf
Checklist:: 2026.eacl-short.22.checklist.pdf

PDF Cite Search Checklist Fix data