Semi-automatic annotation of the UCU accents speech corpus

Rosemary Orr, Marijn Huijbregts, Roeland van Beek, Lisa Teunissen, Kate Backhouse, David van Leeuwen


Abstract
Annotation and labeling of speech tasks in large multitask speech corpora is a necessary part of preparing a corpus for distribution. We address three approaches to annotation and labeling: manual, semi automatic and automatic procedures for labeling the UCU Accent Project speech data, a multilingual multitask longitudinal speech corpus. Accuracy and minimal time investment are the priorities in assessing the efficacy of each procedure. While manual labeling based on aural and visual input should produce the most accurate results, this approach is error-prone because of its repetitive nature. A semi automatic event detection system requiring manual rejection of false alarms and location and labeling of misses provided the best results. A fully automatic system could not be applied to entire speech recordings because of the variety of tasks and genres. However, it could be used to annotate separate sentences within a specific task. Acoustic confidence measures can correctly detect sentences that do not match the text with an EER of 3.3%
Anthology ID:
L14-1424
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1483–1487
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/511_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Rosemary Orr, Marijn Huijbregts, Roeland van Beek, Lisa Teunissen, Kate Backhouse, and David van Leeuwen. 2014. Semi-automatic annotation of the UCU accents speech corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1483–1487, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Semi-automatic annotation of the UCU accents speech corpus (Orr et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/511_Paper.pdf