Ken’ya Nishikawa


2022

pdf bib
Design and Evaluation of the Corpus of Everyday Japanese Conversation
Hanae Koiso | Haruka Amatani | Yasuharu Den | Yuriko Iseki | Yuichi Ishimoto | Wakako Kashino | Yoshiko Kawabata | Ken’ya Nishikawa | Yayoi Tanaka | Yasuyuki Usuda | Yuka Watanabe
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We have constructed the Corpus of Everyday Japanese Conversation (CEJC) and published it in March 2022. The CEJC is designed to contain various kinds of everyday conversations in a balanced manner to capture their diversity. The CEJC features not only audio but also video data to facilitate precise understanding of the mechanism of real-life social behavior. The publication of a large-scale corpus of everyday conversations that includes video data is a new approach. The CEJC contains 200 hours of speech, 577 conversations, about 2.4 million words, and a total of 1675 conversants. In this paper, we present an overview of the corpus, including the recording method and devices, structure of the corpus, formats of video and audio files, transcription, and annotations. We then report some results of the evaluation of the CEJC in terms of conversant and conversation attributes. We show that the CEJC includes a good balance of adult conversants in terms of gender and age, as well as a variety of conversations in terms of conversation forms, places, activities, and numbers of conversants.

2018

pdf bib
Construction of the Corpus of Everyday Japanese Conversation: An Interim Report
Hanae Koiso | Yasuharu Den | Yuriko Iseki | Wakako Kashino | Yoshiko Kawabata | Ken’ya Nishikawa | Yayoi Tanaka | Yasuyuki Usuda
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2014

pdf bib
Design and development of an RDB version of the Corpus of Spontaneous Japanese
Hanae Koiso | Yasuharu Den | Ken’ya Nishikawa | Kikuo Maekawa
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe the design and development of a new version of the Corpus of Spontaneous Japanese (CSJ), which is a large-scale spoken corpus released in 2004. CSJ contains various annotations that are represented in XML format (CSJ-XML). CSJ-XML, however, is very complicated and suffers from some problems. To overcome this problem, we have developed and released, in 2013, a relational database version of CSJ (CSJ-RDB). CSJ-RDB is based on an extension of the segment and link-based annotation scheme, which we adapted to handle multi-channel and multi-modal streams. Because this scheme adopts a stand-off framework, CSJ-RDB can represent three hierarchical structures at the same time: inter-pausal-unit-top, clause-top, and intonational-phrase-top. CSJ-RDB consists of five different types of tables: segment, unaligned-segment, link, relation, and meta-information tables. The database was automatically constructed from annotation files extracted from CSJ-XML by using general-purpose corpus construction tools. CSJ-RDB enables us to easily and efficiently conduct complex searches required for corpus-based studies of spoken language.