Preparation of Bangla Speech Corpus from Publicly Available Audio & Text

Shafayat Ahmed; Nafis Sadeq; Sudipta Saha Shubha; Md. Nahidul Islam; Muhammad Abdullah Adnan; Mohammad Zuberul Islam

Preparation of Bangla Speech Corpus from Publicly Available Audio & Text

Shafayat Ahmed, Nafis Sadeq, Sudipta Saha Shubha, Md. Nahidul Islam, Muhammad Abdullah Adnan, Mohammad Zuberul Islam

Abstract

Automatic speech recognition systems require large annotated speech corpus. The manual annotation of a large corpus is very difficult. In this paper, we focus on the automatic preparation of a speech corpus for Bangladeshi Bangla. We have used publicly available Bangla audiobooks and TV news recordings as audio sources. We designed and implemented an iterative algorithm that takes as input a speech corpus and a huge amount of raw audio (without transcription) and outputs a much larger speech corpus with reasonable confidence. We have leveraged speaker diarization, gender detection, etc. to prepare the annotated corpus. We also have prepared a synthetic speech corpus for handling out-of-vocabulary word problems in Bangla language. Our corpus is suitable for training with Kaldi. Experimental results show that the use of our corpus in addition to the Google Speech corpus (229 hours) significantly improves the performance of the ASR system.

Anthology ID:: 2020.lrec-1.811
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 6586–6592
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.811/
DOI:
Bibkey:
Cite (ACL):: Shafayat Ahmed, Nafis Sadeq, Sudipta Saha Shubha, Md. Nahidul Islam, Muhammad Abdullah Adnan, and Mohammad Zuberul Islam. 2020. Preparation of Bangla Speech Corpus from Publicly Available Audio & Text. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6586–6592, Marseille, France. European Language Resources Association.
Cite (Informal):: Preparation of Bangla Speech Corpus from Publicly Available Audio & Text (Ahmed et al., LREC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.lrec-1.811.pdf

PDF Cite Search Fix data