Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations

Jáchym Kolář, Jan Švec


Abstract
Structural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and to downstream automatic processes. It may be achieved by inserting boundaries of syntactic/semantic units to the flow of speech, labeling non-content words like filled pauses and discourse markers for optional removal, and identifying sections of disfluent speech. This paper compares two Czech MDE speech corpora, one in the domain of broadcast news and the other in the domain of broadcast conversations. A variety of statistics about fillers, edit disfluencies, and syntactic/semantic units are presented. In addition, it is reported that disfluent portions of speech show differences in the distribution of parts of speech (POS) of their content in comparison with the general POS distribution. The two Czech corpora are not only compared with each other, but also with available numbers relating to English MDE corpora of broadcast news and telephone conversations.
Anthology ID:
L08-1120
Volume:
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Month:
May
Year:
2008
Address:
Marrakech, Morocco
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/26_paper.pdf
DOI:
Bibkey:
Cite (ACL):
Jáchym Kolář and Jan Švec. 2008. Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco. European Language Resources Association (ELRA).
Cite (Informal):
Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations (Kolář & Švec, LREC 2008)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/26_paper.pdf