VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System

Arabic is a complex language with many varieties and dialects spoken by ~ 450 millions all around the world. Due to the linguistic diversity and vari-ations, it is challenging to build a robust and gen-eralized ASR system for Arabic. In this work, we address this gap by developing and demoing a system, dubbed VoxArabica, for dialect identi-fication (DID) as well as automatic speech recog-nition (ASR) of Arabic. We train a wide range of models such as HuBERT (DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR tasks. Our DID models are trained to identify 17 different dialects in addition to MSA. We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data. Additionally, for the re-maining dialects in ASR, we provide the option to choose various models such as Whisper and MMS in a zero-shot setting. We integrate these models into a single web interface with diverse features such as audio recording, file upload, model selec-tion, and the option to raise flags for incorrect out-puts. Overall, we believe VoxArabica will be use-ful for a wide range of audiences concerned with Arabic research. Our system is currently running at https://cdce-206-12-100-168.ngrok.io/.


Introduction
The Arabic language, with its diverse regional dialects, represents a unique linguistic spectrum with varying degrees of overlap between the different varieties at all linguistic levels (e.g., phonetic, syntactic, and semantic).In addition to Modern Standard Arabic (MSA), which is primarily used in education, pan-Arab media, and government, there are many local dialects and varieties that are sometimes categorized at regional (Zaidan and Callison-Burch, 2014;Elfardy and Diab, 2013;Elaraby and Abdul-Mageed, 2018), country (Bouamor et al., 2018;Abdul-Mageed et al., 2020a, 2021, 2022), or even province levels (Abdul-Mageed et al., 2020b).Historically, this wide and rich variation between different Arabic varieties has posed a significant challenge for automatic speech recognition (ASR) (Talafha et al., 2023;Alsayadi et al., 2022;Ali, 2020).The main focus has largely been on the recognition of MSA with very little-to-no focus on its dialects and varieties (Dhouib et al., 2022;Hussein et al., 2022;Ali et al., 2014).As such, ASR systems have conventionally been built either for MSA or individual dialects, thereby restricting their versatility and adaptability.However, the multifaceted nature of Arabic demands a robust ASR system that caters for its diverse dialects and varieties.In this work, we fill this research gap by introducing and demoing an ASR system integrated with a dialect identification model, dubbed VoxArabica.
VoxArabica is an end-to-end dialect-aware ASR system with dual functionality: (i) it offers a supervised dialect identification model followed by (ii) a finetuned Whisper Arabic ASR model covering multiple dialects.The dialect identification model works by assigning a country-level dialect, as well as MSA, from a set of 18 labels from input speech.This then allows the appropriate ASR model to fire.Contrary to traditional methodologies that separate dialect identification and speech recognition as two completely different tasks, our proposed pipeline integrates the two components effectively utilizing dialectal information for improved speech recognition.Such an integration not only improves the ASR output, but also establishes a framework aligned with the linguistic diversities inherent to Arabic as well.Concretely, our contributions can be summarized as follows: • We introduce and demo our end-toend VoxArabica system, which integrates dialect identification with state-of-the-art Arabic ASR.
• Our demo is based on a user-friendly web interface characterized with rich functionalities such as audio uploading, audio recording, and user feedback options.
The rest of the paper is organized as follows: In Section 2, we overview related works.Section 3 introduces our methods.Section 4 offers a walkthrough of our demo.We conclude in Section 5.

Literature Review
Arabic ASR.Recent ASR research has focused on end-to-end (E2E) methods such as in Whisper (Radford et al., 2022) and the Universal Speech Model (Zhang et al., 2023).Such E2E deep learning models have significantly elevated ASR performance by allowing learning directly from the audio waveform, bypassing the need for intermediate feature extraction layers (Wang et al., 2019;Radford et al., 2022).Whisper is particularly noteworthy for its multitask training approach, incorporating ASR, voice activity detection, language identification, and speech translation.It has achieved state-of-theart performance on multiple benchmark datasets such as Librispeech (Panayotov et al., 2015) and TEDLIUM (Rousseau et al., 2012).However, its resilience to adversarial noise has been questioned (Olivier and Raj, 2022).
For Arabic ASR specifically, the first E2E model was introduced using recurrent neural networks coupled with Connectionist Temporal Classification (CTC) (Ahmed et al., 2019).Subsequent works have built upon this foundation, including the development of transformer-based models that excel in both MSA and dialects (Belinkov et al., 2019;Hussein et al., 2022).One challenge for E2E ASR models is the substantial requirement for labeled data, particularly for languages with fewer resources such as varieties of Arabic.To address this, self-supervised and semi-supervised learning approaches are gaining traction.These models, such as Wav2vec2.0and XLS-R, initially learn useful representations from large amounts of unlabeled or weakly labeled data and can later be finetuned for specific tasks (Baevski et al., 2020;Babu et al., 2021).W2v-BERT, another self-supervised model, employs contrastive learning and masked language modeling.It has been adapted for Arabic ASR by finetuning on the FLEURS dataset, which represents dialect-accented standard Arabic spoken by Egyptians (Chung et al., 2021;Conneau et al., 2023).Unlike Whisper, both Wav2vec2.0and w2v-BERT necessitate a finetuning stage for effective decoding.Arabic DID.Arabic DID has been the subject of a number of studies through recent years, enhanced by collection of spoken Arabic DID corpora such as ADI5 (Ali et al., 2017) and ADI17 (Shon et al., 2020).And advances in model architecture have mirrored changes in the larger LID research community, from i-vector (Dehak et al., 2010) based approaches (Ali et al., 2017) towards deep learning based approaches: x-vectors (Snyder et al., 2018;Shon et al., 2020), end-to-end classification using deep neural networks (Ali et al., 2019;Cai et al., 2018), and transfer learning (Sullivan et al., 2023).ASR and DID.Combining ASR and DID in a single pipeline remains fairly novel for Arabic.Recent works in this space has employed only limited corpora (Lounnas et al., 2020), or used ASR transcripts only to improve DID (Malmasi and Zampieri, 2017).Closest to our demonstrated system in this work is FarSpeech (Eldesouki et al., 2019), since it combines ASR and DID.However, FarSpeech is confined to coarse-grain DID and only supports MSA for ASR.In addition, compared to FarSpeech, our models are modular in that it allows users to run either or both ASR or DID, depending on their needs.

ASR Models
We train a wide range of ASR models on a list of benchmark Arabic speech datasets.Our models include two versions of Whisper (Radford et al., 2022), large-v2 and small.We also finetune XLS-R (Babu et al., 2022) for the ASR task.For MSA, we train our models on three versions of common voice (Ardila et al., 2019) datasets 6.1, 9.0, and 11.0.We note that Talafha et al. (2023) show that Whisper large-v2 outperforms its smaller variant as well as XLS-R trained on the same dataset.When recording in an unlisted dialect, select "Other".The dialect identification model will then detect the dialect, and both Whisper and MMS zero-shot models will produce the transcription.
For Morrocan, Egyptian, and MSA, we fully finetune models on MGB2, MGB3, MGB5 (Ali et al., 2016(Ali et al., , 2017(Ali et al., , 2019)).We also train ASR models on FLEURS (Conneau et al., 2023), which is accented Egyptian speech data.In our demo, we also allow users to utilize both Whisper and MMS (Pratap et al., 2023) in the zero-shot setting.We provide information about our text preprocessing in Appendix A.

Walkthrough
Our demo consists of a web interface with versatile functionality.It allows users to interact with the system in multiple ways, depending on their needs.User audio input.Users can either record their own audio through a microphone or upload a prerecorded file.In both cases, we allow different formats such as .wav,.mp3,or .flac,across various audio sampling rates (e.g., 16khz or 48khz).Figure 1 demonstrates the different options available to the user upon interacting with VoxArabica.Model selection.Users can choose to select an Arabic variety for transcription, or have it automatically detected using our 18-way DID system.We demonstrate this in Figure 2. Once the variety is detected, the corresponding ASR model will perform transcription and both DID transcription results will be presented on the interface (as shown in Figure 3).We offer various models: two for the EGY and MOR, respectively; two for MSA; and two generic models that can be used for any variety.We list all models in Table 2.In cases where pre- dicted/selected variety is not covered by our ASR models, we fall back to our generic models (i.e., both Whisper zero-shot and MMS zero-shot).
User feedback.We also provide an option for users to submit anonymous feedback about the produced output by raising a flag.We use this information to collect high quality silver labels and discard examples where a flag is raised for incorrect outputs.
It is important to note that we do not collect any external user data for any purpose, thus ensuring user privacy.System output.Our system conveniently outputs both predicted Arabic variety and transcription across two panels as shown in Figure 3.For predicted variety, we show users all top five predictions along with model confidence for each of them.We provide outputs produced by our models in VoxArabica when the reference input is Egyptian dialect in Table 1.We also present additional examples in Appendix B, Table 3.

Conclusion
We present a demonstration of combined DID and ASR pipeline to illustrate the potential for these systo improve the usability of dialectal Arabic speech technologies.We report example outputs produced by our system for multiple dialects showcasing the effectiveness of integrated DID and ASR pipelines.We believe that our demo will advance the research to build a robust and generalized Arabic ASR system for a wide range of varieties and dialects and will enable a more holistic assessment of the strengths and weaknesses of these methods.For future work, we intend to add models for more dialects and varieties particularly those which are low resource.

Limitations
Audio classification tasks can be susceptible to outof-domain performance degradation, which may impact real world performance.Similarly, stud-ies on the interpretability of DID models have shown internal encoding of non-linguistic factors such as gender and channel (Chowdhury et al., 2020), which may impart bias to the models.Ensuring training corpora contain a diverse balance of speaker gender, recording conditions, as well as full coverage of the different styles of language is an ongoing challenge.We hope that by creating an online demonstration, these limitations can be further explored.

Ethics Statement
Intended use.We build a robust dialect identification and speech recognition system for multiple Arabic dialects as well as MSA.We showcase the capability of our system in the demo.We believe that our work will guide a new direction of research to develop a robust and generalized speech recognition system for Arabic.Through our demo, we integrate DID with ASR system which support multiple dialects.Potential misuse and bias.Since our data is limited to a few dialects involved in finetuning DID and ASR systems, we do not expect our models to generalize all varieties and dialects of Arabic that are not supported by our models.

Figure 1 :Figure 2 :
Figure1: Users have the option to either upload files or directly record their audio.Additionally, the dialect can be automatically detected or manually selected for a specific ASR model.

Figure 3 :
Figure3: When a specific dialect is manually selected, its associated ASR model generates the transcription.When recording in an unlisted dialect, select "Other".The dialect identification model will then detect the dialect, and both Whisper and MMS zero-shot models will produce the transcription.

Table 1 :
Example outputs produced by VoxArabica when input audio is Egyptian dialect.