Crowdsourcing a Multi-lingual Speech Corpus: Recording, Transcription and Annotation of the CrowdIS Corpora

Andrew Caines; Christian Bentz; Calbert Graham; Tim Polzehl; Paula Buttery

Crowdsourcing a Multi-lingual Speech Corpus: Recording, Transcription and Annotation of the CrowdIS Corpora

Andrew Caines, Christian Bentz, Calbert Graham, Tim Polzehl, Paula Buttery

Abstract

We announce the release of the CROWDED CORPUS: a pair of speech corpora collected via crowdsourcing, containing a native speaker corpus of English (CROWDED_ENGLISH), and a corpus of German/English bilinguals (CROWDED_BILINGUAL). Release 1 of the CROWDED CORPUS contains 1000 recordings amounting to 33,400 tokens collected from 80 speakers and is freely available to other researchers. We recruited participants via the Crowdee application for Android. Recruits were prompted to respond to business-topic questions of the type found in language learning oral tests. We then used the CrowdFlower web application to pass these recordings to crowdworkers for transcription and annotation of errors and sentence boundaries. Finally, the sentences were tagged and parsed using standard natural language processing tools. We propose that crowdsourcing is a valid and economical method for corpus collection, and discuss the advantages and disadvantages of this approach.

Anthology ID:: L16-1340
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 2145–2152
Language:
URL:: https://aclanthology.org/L16-1340/
DOI:
Bibkey:
Cite (ACL):: Andrew Caines, Christian Bentz, Calbert Graham, Tim Polzehl, and Paula Buttery. 2016. Crowdsourcing a Multi-lingual Speech Corpus: Recording, Transcription and Annotation of the CrowdIS Corpora. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2145–2152, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: Crowdsourcing a Multi-lingual Speech Corpus: Recording, Transcription and Annotation of the CrowdIS Corpora (Caines et al., LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1340.pdf

PDF Cite Search Fix data