Crowdsourcing Speech Data for Low-Resource Languages from Low-Income Workers

Basil Abraham; Danish Goel; Divya Siddarth; Kalika Bali; Manu Chopra; Monojit Choudhury; Pratik Joshi; Preethi Jyoti; Sunayana Sitaram; Vivek Seshadri

Crowdsourcing Speech Data for Low-Resource Languages from Low-Income Workers

Basil Abraham, Danish Goel, Divya Siddarth, Kalika Bali, Manu Chopra, Monojit Choudhury, Pratik Joshi, Preethi Jyoti, Sunayana Sitaram, Vivek Seshadri

Abstract

Voice-based technologies are essential to cater to the hundreds of millions of new smartphone users. However, most of the languages spoken by these new users have little to no labelled speech data. Unfortunately, collecting labelled speech data in any language is an expensive and resource-intensive task. Moreover, existing platforms typically collect speech data only from urban speakers familiar with digital technology whose dialects are often very different from low-income users. In this paper, we explore the possibility of collecting labelled speech data directly from low-income workers. In addition to providing diversity to the speech dataset, we believe this approach can also provide valuable supplemental earning opportunities to these communities. To this end, we conducted a study where we collected labelled speech data in the Marathi language from three different user groups: low-income rural users, low-income urban users, and university students. Overall, we collected 109 hours of data from 36 participants. Our results show that the data collected from low-income participants is of comparable quality to the data collected from university students (who are typically employed to do this work) and that crowdsourcing speech data from low-income rural and urban workers is a viable method of gathering speech data.

Anthology ID:: 2020.lrec-1.343
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 2819–2826
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.343/
DOI:
Bibkey:
Cite (ACL):: Basil Abraham, Danish Goel, Divya Siddarth, Kalika Bali, Manu Chopra, Monojit Choudhury, Pratik Joshi, Preethi Jyoti, Sunayana Sitaram, and Vivek Seshadri. 2020. Crowdsourcing Speech Data for Low-Resource Languages from Low-Income Workers. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2819–2826, Marseille, France. European Language Resources Association.
Cite (Informal):: Crowdsourcing Speech Data for Low-Resource Languages from Low-Income Workers (Abraham et al., LREC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.lrec-1.343.pdf

PDF Cite Search Fix data