Philippine Languages Database: A Multilingual Speech Corpora for Developing Systems for Low-Resource Languages

Rowena Cristina L. Guevara; Rhandley D. Cajote; Michael Gringo Angelo R. Bayona; Crisron Rudolf G. Lucas

Philippine Languages Database: A Multilingual Speech Corpora for Developing Systems for Low-Resource Languages

Rowena Cristina L. Guevara, Rhandley D. Cajote, Michael Gringo Angelo R. Bayona, Crisron Rudolf G. Lucas

Abstract

Previous efforts to collect Filipino speech were done in the development of Filipino-Speech Corpus, TAGCO, and Filipino-Bisaya speech corpus. These corpora, however, are either domain-specific, non-parallel, non-multilingual or relatively insufficient for the development of state-of-the-art Automatic Speech Recognizers (ASR) and Text-To-Speech Systems (TTS) which usually requires hundreds of hours of speech data. This paper presents a multilingual corpora for the Philippine languages namely: Filipino, English, Cebuano, Kapampangan, Hiligaynon, Ilokano, Bikolano, Waray, and Tausug. PLD includes over 454 hours of recordings from speakers of the ten languages, covering multiple domains in news, medical, education, tourism and spontaneous speech. The applicability of the corpus has also been demonstrated in adult and children ASR, phoneme transcriber, voice conversion, and TTS applications.

Anthology ID:: 2024.sigul-1.32
Volume:: Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Maite Melero, Sakriani Sakti, Claudia Soria
Venues:: SIGUL | WS
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 264–271
Language:
URL:: https://aclanthology.org/2024.sigul-1.32/
DOI:
Bibkey:
Cite (ACL):: Rowena Cristina L. Guevara, Rhandley D. Cajote, Michael Gringo Angelo R. Bayona, and Crisron Rudolf G. Lucas. 2024. Philippine Languages Database: A Multilingual Speech Corpora for Developing Systems for Low-Resource Languages. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 264–271, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Philippine Languages Database: A Multilingual Speech Corpora for Developing Systems for Low-Resource Languages (Guevara et al., SIGUL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.sigul-1.32.pdf

PDF Cite Search Fix data