Building Large-Scale Japanese Pronunciation-Annotated Corpora for Reading Heteronymous Logograms

Fumikazu Sato, Naoki Yoshinaga, Masaru Kitsuregawa


Abstract
Although screen readers enable visually impaired people to read written text via speech, the ambiguities in pronunciations of heteronyms cause wrong reading, which has a serious impact on the text understanding. Especially in Japanese, there are many common heteronyms expressed by logograms (Chinese characters or kanji) that have totally different pronunciations (and meanings). In this study, to improve the accuracy of pronunciation prediction, we construct two large-scale Japanese corpora that annotate kanji characters with their pronunciations. Using existing language resources on i) book titles compiled by the National Diet Library and ii) the books in a Japanese digital library called Aozora Bunko and their Braille translations, we develop two large-scale pronunciation-annotated corpora for training pronunciation prediction models. We first extract sentence-level alignments between the Aozora Bunko text and its pronunciation converted from the Braille data. We then perform dictionary-based pattern matching based on morphological dictionaries to find word-level pronunciation alignments. We have ultimately obtained the Book Title corpus with 336M characters (16.4M book titles) and the Aozora Bunko corpus with 52M characters (1.6M sentences). We analyzed pronunciation distributions for 203 common heteronyms, and trained a BERT-based pronunciation prediction model for 93 heteronyms, which achieved an average accuracy of 0.939.
Anthology ID:
2022.lrec-1.770
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
7113–7121
Language:
URL:
https://aclanthology.org/2022.lrec-1.770
DOI:
Bibkey:
Cite (ACL):
Fumikazu Sato, Naoki Yoshinaga, and Masaru Kitsuregawa. 2022. Building Large-Scale Japanese Pronunciation-Annotated Corpora for Reading Heteronymous Logograms. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7113–7121, Marseille, France. European Language Resources Association.
Cite (Informal):
Building Large-Scale Japanese Pronunciation-Annotated Corpora for Reading Heteronymous Logograms (Sato et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.770.pdf