Break it Down into BTS: Basic, Tiniest Subword Units for Korean

Nayeon Kim, Jun-Hyung Park, Joon-Young Choi, Eojin Jeon, Youjin Kang, SangKeun Lee


Abstract
We introduce Basic, Tiniest Subword (BTS) units for the Korean language, which are inspired by the invention principle of Hangeul, the Korean writing system. Instead of relying on 51 Korean consonant and vowel letters, we form the letters from BTS units by adding strokes or combining them. To examine the impact of BTS units on Korean language processing, we develop a novel BTS-based word embedding framework that is readily applicable to various models. Our experiments reveal that BTS units significantly improve the performance of Korean word embedding on all intrinsic and extrinsic tasks in our evaluation. In particular, BTS-based word embedding outperforms the state-of-theart Korean word embedding by 11.8% in word analogy. We further investigate the unique advantages provided by BTS units through indepth analysis.
Anthology ID:
2022.emnlp-main.472
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7007–7024
Language:
URL:
https://aclanthology.org/2022.emnlp-main.472
DOI:
10.18653/v1/2022.emnlp-main.472
Bibkey:
Cite (ACL):
Nayeon Kim, Jun-Hyung Park, Joon-Young Choi, Eojin Jeon, Youjin Kang, and SangKeun Lee. 2022. Break it Down into BTS: Basic, Tiniest Subword Units for Korean. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7007–7024, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Break it Down into BTS: Basic, Tiniest Subword Units for Korean (Kim et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.472.pdf