Japanese Rule-based Grapheme-to-phoneme Conversion System and Multilingual Named Entity Dataset with International Phonetic Alphabet

Yuhi Matogawa, Yusuke Sakai, Taro Watanabe, Chihiro Taguchi


Abstract
In Japanese, loanwords are primarily written in Katakana, a syllabic writing system, based on their pronunciation. However, the transliterated loanwords often exhibit spelling variations, such as the word “Hepburn” being written as “ヘボン (hebon)”, “ヘプバーン (hepubaan)”, “ヘップバーン (heppubaan)”. These orthographical variants pose a bottleneck in multilingual Named Entity Recognition (NER), because named entities (NEs) do not have one-to-one matches. In this study, we introduce a rule-based grapheme-to-phoneme (G2P) system for Japanese based on literature in linguistics and a large-scale multilingual NE dataset with annotations of the International Phonetic Alphabet (IPA), focusing on IPA to address the Katakana spelling variations in loanwords. These rules and dataset are expected to be beneficial for tasks such as NE aggregation, G2P system, construction of cross-lingual language models, and entity linking. We hope our work advances research on Japanese NER with multilingual loanwords by solving the spelling ambiguities.
Anthology ID:
2024.sigmorphon-1.9
Volume:
Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Garrett Nicolai, Eleanor Chodroff, Frederic Mailhot, Çağrı Çöltekin
Venue:
SIGMORPHON
SIG:
SIGMORPHON
Publisher:
Association for Computational Linguistics
Note:
Pages:
77–86
Language:
URL:
https://aclanthology.org/2024.sigmorphon-1.9
DOI:
10.18653/v1/2024.sigmorphon-1.9
Bibkey:
Cite (ACL):
Yuhi Matogawa, Yusuke Sakai, Taro Watanabe, and Chihiro Taguchi. 2024. Japanese Rule-based Grapheme-to-phoneme Conversion System and Multilingual Named Entity Dataset with International Phonetic Alphabet. In Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 77–86, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Japanese Rule-based Grapheme-to-phoneme Conversion System and Multilingual Named Entity Dataset with International Phonetic Alphabet (Matogawa et al., SIGMORPHON 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.sigmorphon-1.9.pdf