Casablanca: Data and Models for Multidialectal Arabic Speech Recognition

Bashar Talafha, Karima Kadaoui, Samar Magdy, Mariem Habiboullah, Chafei Chafei, Ahmed El-Shangiti, Hiba Zayed, Mohamedou Tourad, Rahaf Alhamouri, Rwaa Assi, Aisha Alraeesi, Hour Mohamed, Fakhraddin Alwajih, Abdelrahman Mohamed, Abdellah El Mekki, El Moatez Billah Nagoudi, Benelhadj Saadia, Hamzah Alsayadi, Walid Al-Dhabyani, Sara Shatnawi, Yasir Ech-chammakhy, Amal Makouar, Yousra Berrachedi, Mustafa Jarrar, Shady Shehata, Ismail Berrada, Muhammad Abdul-Mageed


Abstract
In spite of the recent progress in speech processing, the majority of world languages and dialects remain uncovered. This situation only furthers an already wide technological divide, thereby hindering technological and socioeconomic inclusion. This challenge is largely due to the absence of datasets that can empower diverse speech systems. In this paper, we seek to mitigate this obstacle for a number of Arabic dialects by presenting Casablanca, a large-scale community-driven effort to collect and transcribe a multi-dialectal Arabic dataset. The dataset covers eight dialects: Algerian, Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni, and includes annotations for transcription, gender, dialect, and code-switching. We also develop a number of strong baselines exploiting Casablanca. The project page for Casablanca is accessible at: www.dlnlp.ai/speech/casablanca.
Anthology ID:
2024.emnlp-main.1211
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21745–21758
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1211
DOI:
Bibkey:
Cite (ACL):
Bashar Talafha, Karima Kadaoui, Samar Magdy, Mariem Habiboullah, Chafei Chafei, Ahmed El-Shangiti, Hiba Zayed, Mohamedou Tourad, Rahaf Alhamouri, Rwaa Assi, Aisha Alraeesi, Hour Mohamed, Fakhraddin Alwajih, Abdelrahman Mohamed, Abdellah El Mekki, El Moatez Billah Nagoudi, Benelhadj Saadia, Hamzah Alsayadi, Walid Al-Dhabyani, et al.. 2024. Casablanca: Data and Models for Multidialectal Arabic Speech Recognition. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21745–21758, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Casablanca: Data and Models for Multidialectal Arabic Speech Recognition (Talafha et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1211.pdf