Computational Benchmarks for Egyptian Arabic Child Directed Speech

Salam Khalifa, Abed Qaddoumi, Nizar Habash, Owen Rambow


Abstract
We present AraBabyTalk-EGY, an enriched release of the Egyptian Arabic CHILDES corpus, that opens the child-adult interactions genre to modern Arabic NLP research. Starting from the original CHILDES recordings and IPA transcriptions of caregiver-child sessions, we (i) map each IPA token to fully diacritized Arabic script, and (ii) add core part-of-speech tags and lemmas aligned with existing dialectal Arabic morphological resources. These layers yield ~26K annotated tokens suitable for both text- and speech-based NLP tasks. We provide a benchmark on morphological disambiguation and Arabic ASR. We outline lexical and morphosyntactic differences between AraBabyTalk-EGY and general Egyptian Arabic resources, highlighting the value of genre-specific training data for language acquisition studies and Arabic speech technology.
Anthology ID:
2026.eacl-long.102
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2296–2307
Language:
URL:
https://aclanthology.org/2026.eacl-long.102/
DOI:
Bibkey:
Cite (ACL):
Salam Khalifa, Abed Qaddoumi, Nizar Habash, and Owen Rambow. 2026. Computational Benchmarks for Egyptian Arabic Child Directed Speech. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2296–2307, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Computational Benchmarks for Egyptian Arabic Child Directed Speech (Khalifa et al., EACL 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.eacl-long.102.pdf
Checklist:
 2026.eacl-long.102.checklist.pdf