Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license

Matěj Korvas; Ondřej Plátek; Ondřej Dušek; Lukáš Žilka; Filip Jurcicek

Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license

Matěj Korvas, Ondřej Plátek, Ondřej Dušek, Lukáš Žilka, Filip Jurčíček

Abstract

We present a dataset of telephone conversations in English and Czech, developed for training acoustic models for automatic speech recognition (ASR) in spoken dialogue systems (SDSs). The data comprise 45 hours of speech in English and over 18 hours in Czech. Large part of the data, both audio and transcriptions, was collected using crowdsourcing, the rest are transcriptions by hired transcribers. We release the data together with scripts for data pre-processing and building acoustic models using the HTK and Kaldi ASR toolkits. We publish also the trained models described in this paper. The data are released under the CC-BY-SA 3.0 license, the scripts are licensed under Apache 2.0. In the paper, we report on the methodology of collecting the data, on the size and properties of the data, and on the scripts and their use. We verify the usability of the datasets by training and evaluating acoustic models using the presented data and scripts.

Anthology ID:: L14-1443
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 4423–4428
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/535_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Matěj Korvas, Ondřej Plátek, Ondřej Dušek, Lukáš Žilka, and Filip Jurčíček. 2014. Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 4423–4428, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license (Korvas et al., LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/535_Paper.pdf

PDF Cite Search Fix data