Developing Language Resources and NLP Tools for the North Korean Language

Arda Akdemir, Yeojoo Jeon, Tetsuo Shibuya


Abstract
Since the division of Korea, the two Korean languages have diverged significantly over the last 70 years. However, due to the lack of linguistic source of the North Korean language, there is no DPRK-based language model. Consequently, scholars rely on the Korean language model by utilizing South Korean linguistic data. In this paper, we first present a large-scale dataset for the North Korean language. We use the dataset to train a BERT-based language model, DPRK-BERT. Second, we annotate a subset of this dataset for the sentiment analysis task. Finally, we compare the performance of different language models for masked language modeling and sentiment analysis tasks.
Anthology ID:
2022.lrec-1.600
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5595–5600
Language:
URL:
https://aclanthology.org/2022.lrec-1.600
DOI:
Bibkey:
Cite (ACL):
Arda Akdemir, Yeojoo Jeon, and Tetsuo Shibuya. 2022. Developing Language Resources and NLP Tools for the North Korean Language. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5595–5600, Marseille, France. European Language Resources Association.
Cite (Informal):
Developing Language Resources and NLP Tools for the North Korean Language (Akdemir et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.600.pdf