ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic-English

Injy Hamed, Nizar Habash, Slim Abdennadher, Ngoc Thang Vu


Abstract
We present our work on collecting ArzEn-ST, a code-switched Egyptian Arabic-English Speech Translation Corpus. This corpus is an extension of the ArzEn speech corpus, which was collected through informal interviews with bilingual speakers. In this work, we collect translations in both directions, monolingual Egyptian Arabic and monolingual English, forming a three-way speech translation corpus. We make the translation guidelines and corpus publicly available. We also report results for baseline systems for machine translation and speech translation tasks. We believe this is a valuable resource that can motivate and facilitate further research studying the code-switching phenomenon from a linguistic perspective and can be used to train and evaluate NLP systems.
Anthology ID:
2022.wanlp-1.12
Volume:
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Houda Bouamor, Hend Al-Khalifa, Kareem Darwish, Owen Rambow, Fethi Bougares, Ahmed Abdelali, Nadi Tomeh, Salam Khalifa, Wajdi Zaghouani
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
119–130
Language:
URL:
https://aclanthology.org/2022.wanlp-1.12
DOI:
10.18653/v1/2022.wanlp-1.12
Bibkey:
Cite (ACL):
Injy Hamed, Nizar Habash, Slim Abdennadher, and Ngoc Thang Vu. 2022. ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic-English. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), pages 119–130, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic-English (Hamed et al., WANLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.wanlp-1.12.pdf
Video:
 https://aclanthology.org/2022.wanlp-1.12.mp4