Design and development of an RDB version of the Corpus of Spontaneous Japanese

Hanae Koiso, Yasuharu Den, Ken’ya Nishikawa, Kikuo Maekawa


Abstract
In this paper, we describe the design and development of a new version of the Corpus of Spontaneous Japanese (CSJ), which is a large-scale spoken corpus released in 2004. CSJ contains various annotations that are represented in XML format (CSJ-XML). CSJ-XML, however, is very complicated and suffers from some problems. To overcome this problem, we have developed and released, in 2013, a relational database version of CSJ (CSJ-RDB). CSJ-RDB is based on an extension of the segment and link-based annotation scheme, which we adapted to handle multi-channel and multi-modal streams. Because this scheme adopts a stand-off framework, CSJ-RDB can represent three hierarchical structures at the same time: inter-pausal-unit-top, clause-top, and intonational-phrase-top. CSJ-RDB consists of five different types of tables: segment, unaligned-segment, link, relation, and meta-information tables. The database was automatically constructed from annotation files extracted from CSJ-XML by using general-purpose corpus construction tools. CSJ-RDB enables us to easily and efficiently conduct complex searches required for corpus-based studies of spoken language.
Anthology ID:
L14-1371
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1471–1476
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/432_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Hanae Koiso, Yasuharu Den, Ken’ya Nishikawa, and Kikuo Maekawa. 2014. Design and development of an RDB version of the Corpus of Spontaneous Japanese. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1471–1476, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Design and development of an RDB version of the Corpus of Spontaneous Japanese (Koiso et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/432_Paper.pdf