Toward a Web-based Speech Corpus for Algerian Dialectal Arabic Varieties

Soumia Bougrine; Aicha Chorana; Abdallah Lakhdari; Hadda Cherroun

doi:10.18653/v1/W17-1317

Toward a Web-based Speech Corpus for Algerian Dialectal Arabic Varieties

Soumia Bougrine, Aicha Chorana, Abdallah Lakhdari, Hadda Cherroun

Abstract

The success of machine learning for automatic speech processing has raised the need for large scale datasets. However, collecting such data is often a challenging task as it implies significant investment involving time and money cost. In this paper, we devise a recipe for building largescale Speech Corpora by harnessing Web resources namely YouTube, other Social Media, Online Radio and TV. We illustrate our methodology by building KALAM’DZ, An Arabic Spoken corpus dedicated to Algerian dialectal varieties. The preliminary version of our dataset covers all major Algerian dialects. In addition, we make sure that this material takes into account numerous aspects that foster its richness. In fact, we have targeted various speech topics. Some automatic and manual annotations are provided. They gather useful information related to the speakers and sub-dialect information at the utterance level. Our corpus encompasses the 8 major Algerian Arabic sub-dialects with 4881 speakers and more than 104.4 hours segmented in utterances of at least 6 s.

Anthology ID:: W17-1317
Volume:: Proceedings of the Third Arabic Natural Language Processing Workshop
Month:: April
Year:: 2017
Address:: Valencia, Spain
Editors:: Nizar Habash, Mona Diab, Kareem Darwish, Wassim El-Hajj, Hend Al-Khalifa, Houda Bouamor, Nadi Tomeh, Mahmoud El-Haj, Wajdi Zaghouani
Venue:: WANLP
SIGs:: SEMITIC | SIGARAB
Publisher:: Association for Computational Linguistics
Note:
Pages:: 138–146
Language:
URL:: https://aclanthology.org/W17-1317/
DOI:: 10.18653/v1/W17-1317
Bibkey:
Cite (ACL):: Soumia Bougrine, Aicha Chorana, Abdallah Lakhdari, and Hadda Cherroun. 2017. Toward a Web-based Speech Corpus for Algerian Dialectal Arabic Varieties. In Proceedings of the Third Arabic Natural Language Processing Workshop, pages 138–146, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):: Toward a Web-based Speech Corpus for Algerian Dialectal Arabic Varieties (Bougrine et al., WANLP 2017)
Copy Citation:
PDF:: https://aclanthology.org/W17-1317.pdf

PDF Cite Search Fix data