OMAN-SPEECH: A Multi-Layer Annotated Speech Corpus for Omani Arabic Dialects

Rayyan S. Al Khadhuri; Firas Al Mahrouqi; Salim Al Mandhari; Amir Azad Al-Kathiri; Omar Said Alshahri; Ghassab Mansoor Alsaqr; Badri Abdulhakim Mudhsh; Tarek Fatnassi

OMAN-SPEECH: A Multi-Layer Annotated Speech Corpus for Omani Arabic Dialects

Rayyan S. Al Khadhuri, Firas Al Mahrouqi, Salim Al Mandhari, Amir Azad Al-Kathiri, Omar Said Alshahri, Ghassab Mansoor Alsaqr, Badri Abdulhakim Mudhsh, Tarek Fatnassi

Abstract

Automatic Speech Recognition (ASR) has achieved strong performance in high-resource languages; however, Dialectal Arabic remains significantly under-resourced. This gap is particularly evident in Oman, where Arabic exhibits substantial sociolinguistic variation shaped by settlement patterns between sedentary (Hadari) and nomadic (Badu) communities, which are often overlooked by urban-centric or generalized Gulf Arabic datasets. We introduce OMAN-SPEECH, a sociolinguistically stratified spoken corpus for Omani Arabic comprising approximately 40 hours of spontaneous and semi-spontaneous speech from 32 speakers across 11 Wilayats (provinces). The corpus is balanced to capture regional and lifestyle variation and is annotated at the sentence level with Arabic transcription, English translation, and phonetic transcription using the International Phonetic Alphabet (IPA) through a human-in-the-loop annotation pipeline. OMAN-SPEECH provides a foundational resource for evaluating ASR and related speech technologies on Omani and Gulf Arabic varieties and supports more granular modeling of regional dialectal variation.

Anthology ID:: 2026.abjadnlp-1.31
Volume:: Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Venues:: AbjadNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 229–235
Language:
URL:: https://aclanthology.org/2026.abjadnlp-1.31/
DOI:
Bibkey:
Cite (ACL):: Rayyan S. Al Khadhuri, Firas Al Mahrouqi, Salim Al Mandhari, Amir Azad Al-Kathiri, Omar Said Alshahri, Ghassab Mansoor Alsaqr, Badri Abdulhakim Mudhsh, and Tarek Fatnassi. 2026. OMAN-SPEECH: A Multi-Layer Annotated Speech Corpus for Omani Arabic Dialects. In Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pages 229–235, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: OMAN-SPEECH: A Multi-Layer Annotated Speech Corpus for Omani Arabic Dialects (Al Khadhuri et al., AbjadNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.abjadnlp-1.31.pdf

PDF Cite Search Fix data