ArzEn: A Speech Corpus for Code-switched Egyptian Arabic-English

Injy Hamed, Ngoc Thang Vu, Slim Abdennadher


Abstract
In this paper, we present our ArzEn corpus, an Egyptian Arabic-English code-switching (CS) spontaneous speech corpus. The corpus is collected through informal interviews with 38 Egyptian bilingual university students and employees held in a soundproof room. A total of 12 hours are recorded, transcribed, validated and sentence segmented. The corpus is mainly designed to be used in Automatic Speech Recognition (ASR) systems, however, it also provides a useful resource for analyzing the CS phenomenon from linguistic, sociological, and psychological perspectives. In this paper, we first discuss the CS phenomenon in Egypt and the factors that gave rise to the current language. We then provide a detailed description on how the corpus was collected, giving an overview on the participants involved. We also present statistics on the CS involved in the corpus, as well as a summary to the effort exerted in the corpus development, in terms of number of hours required for transcription, validation, segmentation and speaker annotation. Finally, we discuss some factors contributing to the complexity of the corpus, as well as Arabic-English CS behaviour that could pose potential challenges to ASR systems.
Anthology ID:
2020.lrec-1.523
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4237–4246
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.523
DOI:
Bibkey:
Cite (ACL):
Injy Hamed, Ngoc Thang Vu, and Slim Abdennadher. 2020. ArzEn: A Speech Corpus for Code-switched Egyptian Arabic-English. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4237–4246, Marseille, France. European Language Resources Association.
Cite (Informal):
ArzEn: A Speech Corpus for Code-switched Egyptian Arabic-English (Hamed et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.523.pdf