Huqariq: A Multilingual Speech Corpus of Native Languages of Peru forSpeech Recognition

Rodolfo Zevallos; Luis Camacho; Nelsi Melgarejo

Huqariq: A Multilingual Speech Corpus of Native Languages of Peru forSpeech Recognition

Rodolfo Zevallos, Luis Camacho, Nelsi Melgarejo

Abstract

The Huqariq corpus is a multilingual collection of speech from native Peruvian languages. The transcribed corpus is intended for the research and development of speech technologies to preserve endangered languages in Peru. Huqariq is primarily designed for the development of automatic speech recognition, language identification and text-to-speech tools. In order to achieve corpus collection sustainably, we employs the crowdsourcing methodology. Huqariq includes four native languages of Peru, and it is expected that by the year 2022, it can reach up to 20 native languages out of the 48 native languages in Peru. The corpus has 220 hours of transcribed audio recorded by more than 500 volunteers, making it the largest speech corpus for native languages in Peru. In order to verify the quality of the corpus, we present speech recognition experiments using 220 hours of fully transcribed audio.

Anthology ID:: 2022.lrec-1.537
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 5029–5034
Language:
URL:: https://aclanthology.org/2022.lrec-1.537/
DOI:
Bibkey:
Cite (ACL):: Rodolfo Zevallos, Luis Camacho, and Nelsi Melgarejo. 2022. Huqariq: A Multilingual Speech Corpus of Native Languages of Peru forSpeech Recognition. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5029–5034, Marseille, France. European Language Resources Association.
Cite (Informal):: Huqariq: A Multilingual Speech Corpus of Native Languages of Peru forSpeech Recognition (Zevallos et al., LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.537.pdf

PDF Cite Search Fix data