Parallel Corpus for Japanese Spoken-to-Written Style Conversion

Mana Ihori, Akihiko Takashima, Ryo Masumura


Abstract
With the increase of automatic speech recognition (ASR) applications, spoken-to-written style conversion that transforms spoken-style text into written-style text is becoming an important technology to increase the readability of ASR transcriptions. To establish such conversion technology, a parallel corpus of spoken-style text and written-style text is beneficial because it can be utilized for building end-to-end neural sequence transformation models. Spoken-to-written style conversion involves multiple conversion problems including punctuation restoration, disfluency detection, and simplification. However, most existing corpora tend to be made for just one of these conversion problems. In addition, in Japanese, we have to consider not only general spoken-to-written style conversion problems but also Japanese-specific ones, such as language style unification (e.g., polite, frank, and direct styles) and omitted postpositional particle expressions restoration. Therefore, we created a new Japanese parallel corpus of spoken-style text and written-style text that can simultaneously handle general problems and Japanese-specific ones. To make this corpus, we prepared four types of spoken-style text and utilized a crowdsourcing service for manually converting them into written-style text. This paper describes the building setup of this corpus and reports the baseline results of spoken-to-written style conversion using the latest neural sequence transformation models.
Anthology ID:
2020.lrec-1.779
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6346–6353
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.779
DOI:
Bibkey:
Cite (ACL):
Mana Ihori, Akihiko Takashima, and Ryo Masumura. 2020. Parallel Corpus for Japanese Spoken-to-Written Style Conversion. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6346–6353, Marseille, France. European Language Resources Association.
Cite (Informal):
Parallel Corpus for Japanese Spoken-to-Written Style Conversion (Ihori et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.779.pdf