Pseudo-Labeling for Domain-Agnostic Bangla Automatic Speech Recognition

One of the major challenges for developing automatic speech recognition (ASR) for low-resource languages is the limited access to labeled data with domain-specific variations. In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. With the proposed methodology, we developed a 20k+ hours labeled Bangla speech dataset covering diverse topics, speaking styles, dialects, noisy environments, and conversational scenarios. We then exploited the developed corpus to design a conformer-based ASR system. We benchmarked the trained ASR with publicly available datasets and compared it with other available models. To investigate the efficacy, we designed and developed a human-annotated domain-agnostic test set composed of news, telephony, and conversational data among others. Our results demonstrate the efficacy of the model trained on psuedo-label data for the designed test-set along with publicly-available Bangla datasets. The experimental resources will be publicly available.https://github.com/hishab-nlp/Pseudo-Labeling-for-Domain-Agnostic-Bangla-ASR


Introduction
Modern end-to-end automatic speech recognition (E2E-ASR) systems have made remarkable strides, performing well across various types of data (Li et al., 2020;Gulati et al., 2020;Chowdhury et al., 2021;Prabhavalkar et al., 2023).This success can be attributed to the advancement of deep learning techniques relying on different training strategies, highly dependent on large datasets.However, acquiring and maintaining these high-quality human transcriptions is both expensive and timeconsuming, and hence hinders further progress for ASR especially in low-resource languages like Bangla.
To overcome these challenges, two dominant methods, leveraging unlabeled audio, are gaining popularity.These methods include: (i) pre-training via Self-supervised learning (SSL) (Baevski et al., 2020(Baevski et al., , 2022;;Chung et al., 2021;Hsu et al., 2021); (ii) pseudo-labeling (PL) (Kahn et al., 2020;Xu et al., 2020b;Manohar et al., 2021;Zhu et al., 2023;Xu et al., 2020a;Higuchi et al., 2022).In the pre-training approach, the model is initially trained on raw unlabeled data and then fine-tuned using limited labeled data for some downstream ASR tasks.In pseudo-labeling, a pre-trained model generates labels for unlabeled data, which are then used alongside real labels for supervised ASR training.This paradigm is widely adopted due to its simplicity and effectiveness.Both SSL and PL have been shown to achieve competitive results with minimal labeled data, hence making these paradigms, especially PL, suitable for low-resource languages.
Despite being the 6 th most widely spoken language globally, Bangla still falls under low resource language family mainly due to the lack of accessible open datasets.To reduce this gap, we introduce a pseudo-labeling approach to develop an extensive, large-scale, and high-quality speech dataset of ≈ 20, 000 hours for developing domain-agnostic Bangla ASR.First, we curated and cleaned the largest collection of Bangla audio-video data from various Bangla TV channels on YouTube (YT)varying domains, speaking styles, dialects, and communication channels among others.We then leverage the alignments from two ASR systems, to segment and automatically annotate the audio segments.We enrich the quality of pseudo-labels with our confidence and duration-based filtering method.We utilize the created dataset to design an end-to-end state-of-the-art Bangla ASR.Finally, we benchmark the ASR with widely used, domainagnostic test sets and compare it with both publicly and commercially available Bangla ASR systems.To test domain-generalization capability, we also developed manually annotated test sets that include domain-diverse speech segments.Our contributions are as follows: • We develop and release MegaBNSpeech -the largest Bangla speech (≈ 20, 000 hours) training corpus, alongside with its metadata; • We introduce a robust data collection pipeline that systematically extracted audio segments from listed channels, ensuring wide coverage of speech samples; • We developed and publicly released a domainagnostic state-of-the-art Bangla ASR model; • We developed two test sets comprising (a) diversified domain data from YT; and (b) real-life telephony conversational data, to test model generalizability across domains; • We benchmark the proposed domain-agnostic Bangla ASR with publicly available test data and ASR models.
The rest of the paper is organized as follows: Section 2 presents previous work, Section 3 describes the dataset, Section 4 formulates our experiments, Section 5 discusses the evaluation results.Finally, Section 6 concludes and points to possible directions for future work.

Speech Datasets Development
In the realm of speech corpus development, a variety of methods and techniques have been employed across multiple languages.For example, Wang et al. (2005) focused on Mandarin Chinese, creating a speech corpus from broadcast news and aligning the transcriptions.Similarly, Radeck-Arneth et al. (2015) curated data from diverse sources like audiobooks and web recordings to create a comprehensive speech corpus for German.In terms of automatic speech recognition datasets, Chui and Lai (2008) employed a method that constructs a Mandarin Chinese speech corpus using online videos and automated transcription.In a similar vein, Cho et al. (2021) harnessed web data and automatic alignment techniques to develop a Korean speech corpus geared toward speech recognition research.
Furthermore, current literature has also focused on specialized domains or applications.For instance, in the medical field, Cho et al. (2021) crafted a targeted speech corpus designed for medical dictation tasks, featuring recordings from healthcare professionals.Similarly, in the context of voice assistants, Gale et al. (2019) developed a corpus explicitly aimed at training and evaluating voice-controlled systems.

Speech datasets for Bangla
There have been several recent works for Bangla Speech Recognition.Sumit et al. (2018) proposed a deep learning based on approach and evaluated model on clean (Alam et al., 2010) and noisy speech datasets (Bills et al., 2016).Ahmed et al. (2020) (Conneau et al., 2023) 15.61 Wikipedia Human Common Voice13 (Ardila et al., 2020) 65.71 Open domain Human OpenSLR (Kjartansson et al., 2018) 229 Open domain Human Bengali Speech Corpus (Ahmed et al., 2020) (Ardila et al., 2020).The Common Voice 13 dataset includes 20.7k training, 9.23k testing, and 9.23k validation audio files.We also segregated the test files from this dataset for evaluation with our selected models.
The OpenSLR Bangla dataset, identified as OpenSLR-53, is a substantial ASR corpus sponsored by Google.It consists of a total of 232,537 recordings, amounting to 229 hours of audio data.For our evaluation purposes, we downloaded specific portions of this dataset and randomly selected 10,142 files, amounting to 10 hours of audio data.
Our introduced dataset surpasses all other available Bangla ASR datasets in terms of dataset size and annotation strategy, as outlined in Table 1.Compared to other methodologies, our data annotation pipeline is specialized in several crucial aspects.First, we focus on the manual curation of channels, allowing us to select content from reputable sources, thus enhancing both relevance and diversity.Second, our pipeline leverages both Hybrid ASR and Conformer ASR Models, which are potentially fine-tuned for Bangla, resulting in more accurate transcriptions.Finally, we have implemented a duplicate removal system to remove redundant content.These features make our data annotation process an excellent fit for applications that demand high-quality, domain-specific Bangla language resources.

Data Collection
To develop a large-scale dataset focused on diverse domains, we selected YouTube as our data source due to its extensive coverage of Bangla speech.We gathered content from popular news channels such as ATN News, Banglavision News, ZEE 24 Ghanta, News18bangla, Republic Bangla, DD Bangla News, ABP Ananda, NTV News, DBC News, BBC News Bangla, Channel 24, mytvbd news, News24, and Channel I News, among others.Additionally, we included talk shows like RTV Talkshow and ATN Bangla Talk Show.We have also incorporated travel VLOGs into our dataset.
Crawler: To facilitate the collection of data from YouTube, we developed a web crawler that periodically collects videos using youtube-dl. 3This crawler operates on a list of YouTube channels that we manually pre-select to ensure domain diversity.The crawler then lists all available videos from each channel and proceeds to download them.The download module within the crawler stores the downloaded videos in a Google Cloud Storage (GCS) bucket.The resulting collection consists of ∼53K hours with 42K number of videos.
Audio Extraction: We extracted audio from the videos, which were originally in Opus format.To ensure compatibility and standardization, we converted these Opus files to WAV format with a sampling rate of 16 kHz.The conversion process demanded the use of both high CPU and low memory resources.In Figure 1, we provide the data collection pipeline.

Pseudo Labeling
In Figure 2, we report the architecture of our proposed pseudo labeling approach for the MegaB-NSpeech corpus development.The system takes audio files extracted from videos and passes them into two distinct in-house developed ASR systems: • Hybrid ASR (E 1 ): Kaldi (Povey et al., 2011) based Factorized Time Delayed Neu-   these ASR systems to generate transcription based on their decisions.We use the term expert to refer to these systems.
As part of our proposed pseudo-labeling approach, we consider them as expert systems.Based on the transcripts they generate, we take their decisions on segments that match, as depicted in Figure 2. To formally define this, we have two expert systems E 1 and E 2 , each of which generates transcripts T 1 and T 2, respectively.We use a matching algorithm, Algorithm 1, that employs exact string matching to align the text of segments from the experts E 1 and E 2 ASR systems.The next step involves segmenting the audio based on matching text and removing the segments that do not match.For example, the words highlighted in red in Figure 2 indicate mismatched segments.We therefore remove these segments.The subsequent step is to filter out segments based on predefined criteria.These include: (i) confidence score of the ASR systems, (ii) minimum and maximum duration of the segments, (iii) the ratio of segment duration to the number of words, and (iv) the minimum number of words required in a segment.These steps resulted in the final MegaBNSpeech corpus.
for each m in M do end for 15: end for where r w,min and r w,max refers to minimum and maximum word rate; d a,min and d a,max refers to minimum and maximum segment duration; c t,min refers to minimum number of characters, and w t,min refers to minimum number of total words; f (t 1 , t 2 ) is the longest substring matching function.

Metadata
To ensure both reproducibility and transferability, we store the metadata in JSON format.This metadata includes the following key elements: (i) au-dio_filepath, (ii) text, and (iii) duration.The au-dio_filepath field specifies the path to the audio file, with channel information embedded in the filename.The text field contains the data generated by the pseudo-labeling pipeline, while the duration field indicates the length of the audio in seconds.The audio files have a sampling rate of 16 kHz.

Data splits
Training set For training the model, the dataset we selected comprises 17.64k hours of news channel content, 688.82 hours of talk shows, 0.02 hours of vlogs, and 4.08 hours of crime shows.Table 2 provides detailed information about each category and its corresponding duration in hours.

Development set
To investigate the robustness of the pseudo-labeling approach, we randomly selected 10 hours of speech to create a development set.
Test set To evaluate the performance of the models, we used four test sets.Two of these were developed as part of the MegaBNSpeech corpus, while the remaining two (Fleurs and Common Voice) are commonly used test sets that are widely recognized by the speech community.
• MegaBNSpeech-YT Test Set : The test set has been prepared from a recent collection of YouTube videos, resulting in 8 hours of data.This set is manually transcribed for evaluation purposes.The domains of this set include News, Talkshow, Courses, Drama, Science, Waz (Islamic preaching), etc.
• MegaBNSpeech-Tele Test Set: To assess the model's generalization capabilities, we also included 1.9 hours of telephony conversations from our in-house dataset collection, which were subsequently manually transcribed.It involves telephone conversations covering various discussion topics, including online food orders, health services, online ticket bookings, and online banking.The calls were originally recorded using 8kHz sampling rate, which we then upsampled to 16kHz to match the ASR input.7 • Fleurs: Fleur's (Conneau et al., 2023) datasets are from FLoRes-101 datasets 8 which contain 3001 Wikipedia sentences.The authors translated dev and train sentences from FLoRes-101 to 102 languages and annotated them to develop ASR.We have separated the Bangla test datasets which contain 920 audio files with 3.43 hours of data.Fleurs contains a total of 3,010 train, 920 test, and 402 validation audio files.We separated the test datasets and evaluated them with our selected models.
• Common Voice: Common voice (Ardila et al., 2020) is a massively multilingual ASR dataset.The dataset currently consists of 17,689 validated hours in 108 languages, but more voices and languages are always added.
Common Voice 13 contains a total of 20.7k train, 9.23k of test, and 9.23k of validation audio files.We separated the test datasets and evaluated them with our selected models.

Contemporary ASR Models
Google: Google speech-to-text 9 is a cloud-based ASR service that provides transcription from input Audio for several languages.It provides different domain-specific models for task-specific ASR services.We used the default model and settings and set the language to Bangla.

MegaBNSpeech ASR
We trained the FastConformer model (Rekesh et al., 2023) using the full 18k MegaBNSpeech training sets.During the training phase, we employed a set of predefined parameters: a learning rate of 0.5, a weight decay of 0.001, a batch size of 32, AdamW optimizer, and a maximum audio duration of 15 seconds.We provide details of the hyperparameter settings in Table 3.
To optimize the performance of our model, we conducted experiments with various NVIDIA NeMo architectures and assessed their training accuracy.Specifically, we evaluated the Conformer-CTC, Conformer-Transducer, and Fast-Conformer models.Among these, the Conformer-CTC model exhibited the best performance, achieving a training loss of approximately 11.2%.
To accelerate the training process, we deployed a total of 16 A100 − 40G GPUs to handle the entire dataset.Despite leveraging significant computational resources, the training still took approximately 112 hours to complete.
The model underwent training for 15 epochs, completing approximately 90,911 global steps.The chosen learning rate was relatively low, contributing to stable and incremental updates of the model's parameters.Although the training loss suggests potential for further improvement, it does indicate a narrowing gap between predicted and actual values during the training phase.
As for the WER the value indicates that our model performed with commendable accuracy.However, the validation loss remains somewhat elevated.These metrics offer valuable insights into the model's performance and serve as a road map for future optimization efforts.

Data Post-processing
During the evaluation of the test sets, we apply a set of post-processing on predicted transcription and human annotation to reduce unexpected symbols, confused words, and misleading alignment.
We find that there are some typing issues during manual labeling.To resolve this, a typing error minimization function is applied.In addition, we added two common normalization rules including: (i) number-to-word conversation and (ii) punctuation removal.Minimizing the confusion due to writing style An extensive analysis of transcriptions indicates many words have different forms of writing (as shown in Figure 3) based on different character combinations.In some cases, both words of confused pairs are acceptable as people annotated in different ways, especially for country names, along with borrowed or code-mixed words.
To minimize these differences, we created a simple Global Mapping File (GLM) that allows different variations of the word to be accepted during evaluation.The GLM file contains entries for different homophones, primarily those with spelling variations.We employed the most frequently occurring confusion patterns for the task, although this approach may not cover all possible variations.

Evaluation Metrics
To evaluate the performance of the models, we used widely accepted metrics such as Word Er-ror Rate (WER) and Character Error Rate (CER).The reported WER values are presented using the GLM and postprocessing of the hypothesis and references.

Robustness of Pseudo-labelling
We first evaluate the robustness of our annotation process for unlabeled audio by utilizing our proposed pseudo-labeling approach.To investigate the quality of these annotations, we used the development set mentioned earlier.This set was subsequently annotated by a human annotator who had no prior knowledge of the ASR-generated pseudolabels.We then computed the Word Error Rate (WER) and Character Error Rate (CER) between these pseudo-labels (serving as predictions) and the human annotations (acting as ground truth).We observed WER and CER rates of less than 3% (specifically, 2.89% for WER and 2.27% for CER), thereby increasing our confidence in the reliability of the pseudo-labeled datasets.

Effectiveness of MegaBNSpeech ASR
We initially assess the performance of MegaBN-Speech ASR, which is fully trained on a pseudolabeled dataset, and compare its ASR performance against other systems such as Google, MMS, and OOD-speech ASRs.Utilizing our in-domain test set (MegaBNSpeech-YT), we noticed a significant performance gap; MegaBNSpeech ASR outperformed the commercial Google ASR, which itself was notably better than the rest (see Table 4).
One plausible explanation for MegaBNSpeech's high performance could be the nature of its training data, which is predominantly sourced from News and Talkshow segments, followed by Science content.These sources typically feature formal speaking styles and limited linguistic diversity, thereby contributing to improved performance.This hypothesis is further supported by the category-level performance data, especially within the 'News' category, as indicated in Table 5.From our analysis, we found that MegaBN-Speech performs comparably to supervised out-ofdomain (OOD) speech ASR systems, even when exposed to data or domains it has not previously encountered.This shows the efficacy of pseudolabeling as well as the potential of both the MegaB-NSpeech datasets and the model.In this study, we trained MegaBNSpeech exclusively with pseudolabels to demonstrate the impact of this automated labeling technique.In practical applications, supplementing pseudo-labels with a small amount of manually annotated data can further enhance ASR performance while leveraging the model's strong generalization capabilities.

Conclusion and Future Work
This study offers a significant contribution in Bangla speech processing, in addition to the field of ASR particularly for low-resource language.The primary contribution of this paper lies in demonstrating that the model trained with pseudo-labeling only, offers comparable performance with supervised ASR systems.Specifically, the MegaBN-Speech model excels in their ability to generalize across multiple domains and channels as shown in results.Additionally, the developed train, development, and two test sets of MegaBNSpeech corpus of ≈ 20, 000 hours of data will serve as a valuable resource for the research community.The MegaB-NSpeech corpus, especially the manually annotated YT and telephony test sets, can be used as a benchmark for future studies, enabling other researchers to build upon our work and potentially discover even more effective methods for designing lowresource ASR.
word rate of m 7: d a ← segment duration of m 8: c t ← total characters in m 9: w t ← total words in m 10: if r w < r w,min or r w > r w,max or d a < d a,min or d a > d a,max or c t < c t,min or w t < w t,min then

Figure 2 :
Figure 2: Architecture of the proposed pseudo labeling approach.

Table 1 :
A comparison of commonly used Bangla ASR datasets 402 validation audio files.We isolated the test files to evaluate them using our chosen models.Common Voice is a comprehensive, multilingual ASR dataset.As of now, the dataset features 17,689 validated hours across 108 languages, with continual additions of new voices and languages

Table 3 :
Details of the hyperparameter settings.wastrainedusing NVIDIA NeMo 11 framework and published in Huggingface model hub.12

Table 5 ,
we report the WER for each category within the MegaBNSpeech-YT test set.From the table, it is evident that all the ASRs (except MMS) perform exceptionally well in the broadcast domain, specifically in News, with MegaBNSpeech achieving nearly 98% accuracy.In the case of talk shows -a

Table 4 :
Reported Word error rate (WER) /character error rate (CER) on four test sets using four ASR systems.* represent the training portion of the corresponding test set was not present in the ASR model.

Table 5 :
Reported Word error rate (WER) /character error rate (CER) on different categories present in MegaBN-Speech -YT test set for four different ASR systems.To assess how ASR models perform not just in unfamiliar domains but also across different communication channels,13we evaluated these four models using telephony conversational data, as shown in MegaBNSpeech-Tel Table4.Our results indicate that MegaBN-Speech ASR significantly outperforms all other ASRs, with Google coming in second place.This level of performance is consistent with our earlier observations that MegaBNSpeech ASR excels in conversation-style categories like talk shows and vlogs.5.4 Key Points: Psuedo-labelling based ASR vs Fully-supervised ASRTraditional ASR training relies heavily on extensive labeled datasets, a requirement that becomes both challenging and expensive to meet for languages, dialects, and domains with limited resources.In contrast, pseudo-labeling not only enriches the training data but also diversifies domain-specific variations, as demonstrated in this study.