An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks

Kyubyong Park, Joohong Lee, Seongbo Jang, Dawoon Jung


Abstract
Typically, tokenization is the very first step in most text processing works. As a token serves as an atomic unit that embeds the contextual information of text, how to define a token plays a decisive role in the performance of a model. Even though Byte Pair Encoding (BPE) has been considered the de facto standard tokenization method due to its simplicity and universality, it still remains unclear whether BPE works best across all languages and tasks. In this paper, we test several tokenization strategies in order to answer our primary research question, that is, “What is the best tokenization strategy for Korean NLP tasks?” Experimental results demonstrate that a hybrid approach of morphological segmentation followed by BPE works best in Korean to/from English machine translation and natural language understanding tasks such as KorNLI, KorSTS, NSMC, and PAWS-X. As an exception, for KorQuAD, the Korean extension of SQuAD, BPE segmentation turns out to be the most effective. Our code and pre-trained models are publicly available at https://github.com/kakaobrain/kortok.
Anthology ID:
2020.aacl-main.17
Volume:
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Month:
December
Year:
2020
Address:
Suzhou, China
Editors:
Kam-Fai Wong, Kevin Knight, Hua Wu
Venue:
AACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
133–142
Language:
URL:
https://aclanthology.org/2020.aacl-main.17
DOI:
Bibkey:
Cite (ACL):
Kyubyong Park, Joohong Lee, Seongbo Jang, and Dawoon Jung. 2020. An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 133–142, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks (Park et al., AACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.aacl-main.17.pdf
Code
 kakaobrain/kortok
Data
KorNLIKorSTSPAWS-XSQuAD