Multi-Channel Database of Spontaneous Czech with Synchronization of Channels Recorded by Independent Devices

Petr Pollák, Josef Rajnoha


Abstract
This paper describes Czech spontaneous speech database of lectures on digital signal processing topic collected at Czech Technical University in Prague, commonly with the procedure of its recording and annotation. The database contains 21.7 hours of speech material from 22 speakers recorded in 4 channels with 3 principally different microphones. The annotation of the database is composed from basic time segmentation, orthographic transcription including marks for speaker and environmental non-speech events, pronunciation lexicon in SAMPA alphabet, session and speaker information describing recording conditions, and the documentation. The orthographic transcription with time segmentation is saved in XML format supported by frequently used annotation tool Transcriber. In this article, special attention is also paid to the description of time synchronization of signals recorded by two independent devices: computer based recording platform using two external sound cards and commercial audio recorder Edirol R09. This synchronization is based on cross-correlation analysis with simple automated selection of suitable short signal subparts. The collection and annotation of this database is now complete and its availability via ELRA is currently under preparation.
Anthology ID:
L10-1356
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/516_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Petr Pollák and Josef Rajnoha. 2010. Multi-Channel Database of Spontaneous Czech with Synchronization of Channels Recorded by Independent Devices. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Multi-Channel Database of Spontaneous Czech with Synchronization of Channels Recorded by Independent Devices (Pollák & Rajnoha, LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/516_Paper.pdf