The Nijmegen Corpus of Casual Czech

Mirjam Ernestus; Lucie Kočková-Amortová; Petr Pollák

The Nijmegen Corpus of Casual Czech

Mirjam Ernestus, Lucie Kočková-Amortová, Petr Pollak

Abstract

This article introduces a new speech corpus, the Nijmegen Corpus of Casual Czech (NCCCz), which contains more than 30 hours of high-quality recordings of casual conversations in Common Czech, among ten groups of three male and ten groups of three female friends. All speakers were native speakers of Czech, raised in Prague or in the region of Central Bohemia, and were between 19 and 26 years old. Every group of speakers consisted of one confederate, who was instructed to keep the conversations lively, and two speakers naive to the purposes of the recordings. The naive speakers were engaged in conversations for approximately 90 minutes, while the confederate joined them for approximately the last 72 minutes. The corpus was orthographically annotated by experienced transcribers and this orthographic transcription was aligned with the speech signal. In addition, the conversations were videotaped. This corpus can form the basis for all types of research on casual conversations in Czech, including phonetic research and research on how to improve automatic speech recognition. The corpus will be freely available.

Anthology ID:: L14-1162
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 365–370
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/134_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Mirjam Ernestus, Lucie Kočková-Amortová, and Petr Pollak. 2014. The Nijmegen Corpus of Casual Czech. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 365–370, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: The Nijmegen Corpus of Casual Czech (Ernestus et al., LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/134_Paper.pdf

PDF Cite Search Fix data