OpenKorPOS: Democratizing Korean Tokenization with Voting-Based Open Corpus Annotation

Sangwhan Moon, Won Ik Cho, Hye Joo Han, Naoaki Okazaki, Nam Soo Kim


Abstract
Korean is a language with complex morphology that uses spaces at larger-than-word boundaries, unlike other East-Asian languages. While morpheme-based text generation can provide significant semantic advantages compared to commonly used character-level approaches, Korean morphological analyzers only provide a sequence of morpheme-level tokens, losing information in the tokenization process. Two crucial issues are the loss of spacing information and subcharacter level morpheme normalization, both of which make the tokenization result challenging to reconstruct the original input string, deterring the application to generative tasks. As this problem originates from the conventional scheme used when creating a POS tagging corpus, we propose an improvement to the existing scheme, which makes it friendlier to generative tasks. On top of that, we suggest a fully-automatic annotation of a corpus by leveraging public analyzers. We vote the surface and POS from the outcome and fill the sequence with the selected morphemes, yielding tokenization with a decent quality that incorporates space information. Our scheme is verified via an evaluation done on an external corpus, and subsequently, it is adapted to Korean Wikipedia to construct an open, permissive resource. We compare morphological analyzer performance trained on our corpus with existing methods, then perform an extrinsic evaluation on a downstream task.
Anthology ID:
2022.lrec-1.531
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4975–4983
Language:
URL:
https://aclanthology.org/2022.lrec-1.531
DOI:
Bibkey:
Cite (ACL):
Sangwhan Moon, Won Ik Cho, Hye Joo Han, Naoaki Okazaki, and Nam Soo Kim. 2022. OpenKorPOS: Democratizing Korean Tokenization with Voting-Based Open Corpus Annotation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4975–4983, Marseille, France. European Language Resources Association.
Cite (Informal):
OpenKorPOS: Democratizing Korean Tokenization with Voting-Based Open Corpus Annotation (Moon et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.531.pdf