Firas Al Mahrouqi


2026

Automatic Speech Recognition (ASR) has achieved strong performance in high-resource languages; however, Dialectal Arabic remains significantly under-resourced. This gap is particularly evident in Oman, where Arabic exhibits substantial sociolinguistic variation shaped by settlement patterns between sedentary (Hadari) and nomadic (Badu) communities, which are often overlooked by urban-centric or generalized Gulf Arabic datasets. We introduce OMAN-SPEECH, a sociolinguistically stratified spoken corpus for Omani Arabic comprising approximately 40 hours of spontaneous and semi-spontaneous speech from 32 speakers across 11 Wilayats (provinces). The corpus is balanced to capture regional and lifestyle variation and is annotated at the sentence level with Arabic transcription, English translation, and phonetic transcription using the International Phonetic Alphabet (IPA) through a human-in-the-loop annotation pipeline. OMAN-SPEECH provides a foundational resource for evaluating ASR and related speech technologies on Omani and Gulf Arabic varieties and supports more granular modeling of regional dialectal variation.

2025

In this paper, we describe our contribution in Ahasis shared task: Sentiment analysis on Arabic Dialects in the Hospitality Domain. Through the presented framework, we explored using two learning strategies tailored to a Large Language Model (LLM) and Transformer-based model variants. While few-shot prompting was used with GPT-4o, fine-tuning was adopted once to refine the essential MARBERT model on the Ahasis dataset and then to utilize a MARBERT variant model, SODA-BERT, that was pretrained on an Omani sentiment dataset and later evaluated with the shared task data.