Zero-shot North Korean to English Neural Machine Translation by Character Tokenization and Phoneme Decomposition

Hwichan Kim, Tosho Hirasawa, Mamoru Komachi


Abstract
The primary limitation of North Korean to English translation is the lack of a parallel corpus; therefore, high translation accuracy cannot be achieved. To address this problem, we propose a zero-shot approach using South Korean data, which are remarkably similar to North Korean data. We train a neural machine translation model after tokenizing a South Korean text at the character level and decomposing characters into phonemes. We demonstrate that our method can effectively learn North Korean to English translation and improve the BLEU scores by +1.01 points in comparison with the baseline.
Anthology ID:
2020.acl-srw.11
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Month:
July
Year:
2020
Address:
Online
Editors:
Shruti Rijhwani, Jiangming Liu, Yizhong Wang, Rotem Dror
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
72–78
Language:
URL:
https://aclanthology.org/2020.acl-srw.11
DOI:
10.18653/v1/2020.acl-srw.11
Bibkey:
Cite (ACL):
Hwichan Kim, Tosho Hirasawa, and Mamoru Komachi. 2020. Zero-shot North Korean to English Neural Machine Translation by Character Tokenization and Phoneme Decomposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 72–78, Online. Association for Computational Linguistics.
Cite (Informal):
Zero-shot North Korean to English Neural Machine Translation by Character Tokenization and Phoneme Decomposition (Kim et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-srw.11.pdf
Video:
 http://slideslive.com/38928635