2025
pdf
bib
abs
Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Fakhraddin Alwajih
|
Abdellah El Mekki
|
Samar Mohamed Magdy
|
AbdelRahim A. Elmadany
|
Omer Nacar
|
El Moatez Billah Nagoudi
|
Reem Abdel-Salam
|
Hanin Atwany
|
Youssef Nafea
|
Abdulfattah Mohammed Yahya
|
Rahaf Alhamouri
|
Hamzah A. Alsayadi
|
Hiba Zayed
|
Sara Shatnawi
|
Serry Sibaee
|
Yasir Ech-chammakhy
|
Walid Al-Dhabyani
|
Marwa Mohamed Ali
|
Imen Jarraya
|
Ahmed Oumar El-Shangiti
|
Aisha Alraeesi
|
Mohammed Anwar AL-Ghrawi
|
Abdulrahman S. Al-Batati
|
Elgizouli Mohamed
|
Noha Taha Elgindi
|
Muhammed Saeed
|
Houdaifa Atou
|
Issam Ait Yahia
|
Abdelhak Bouayad
|
Mohammed Machrouh
|
Amal Makouar
|
Dania Alkawi
|
Mukhtar Mohamed
|
Safaa Taher Abdelfadil
|
Amine Ziad Ounnoughene
|
Anfel Rouabhia
|
Rwaa Assi
|
Ahmed Sorkatti
|
Mohamedou Cheikh Tourad
|
Anis Koubaa
|
Ismail Berrada
|
Mustafa Jarrar
|
Shady Shehata
|
Muhammad Abdul-Mageed
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce PALM, a year-long community-driven project covering all 22 Arab countries. The dataset contains instruction–response pairs in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world—each an author of this paper—PALM offers a broad, inclusive perspective. We use PALM to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations: while closed-source LLMs generally perform strongly, they still exhibit flaws, and smaller open-source models face greater challenges. Furthermore, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data are publicly available for reproducibility. More information about PALM is available on our project page: https://github.com/UBC-NLP/palm.
pdf
bib
abs
ArabEmoNet: A Lightweight Hybrid 2D CNN-BiLSTM Model with Attention for Robust Arabic Speech Emotion Recognition
Ali Abouzeid
|
Bilal Elbouardi
|
Mohamed Maged
|
Shady Shehata
Proceedings of The Third Arabic Natural Language Processing Conference
Speech emotion recognition is vital for human-computer interaction, particularly for low-resource languages like Arabic, which face challenges due to limited data and research. We introduce ArabEmoNet, a lightweight architecture designed to overcome these limitations and deliver state-of-the-art performance. Unlike previous systems relying on discrete MFCC features and 1D convolutions, which miss nuanced spectro-temporal patterns, ArabEmoNet uses Mel spectrograms processed through 2D convolutions, preserving critical emotional cues often lost in traditional methods. While recent models favor large-scale architectures with millions of parameters, ArabEmoNet achieves superior results with just 1 million parameters—90 times smaller than HuBERT base and 74 times smaller than Whisper. This efficiency makes it ideal for resource-constrained environments. ArabEmoNet advances Arabic speech emotion recognition, offering exceptional performance and accessibility for real-world applications.
pdf
bib
abs
BALSAM: A Platform for Benchmarking Arabic Large Language Models
Rawan Nasser Almatham
|
Kareem Mohamed Darwish
|
Raghad Al-Rasheed
|
Waad Thuwaini Alshammari
|
Muneera Alhoshan
|
Amal Almazrua
|
Asma Al Wazrah
|
Mais Alheraki
|
Firoj Alam
|
Preslav Nakov
|
Norah A. Alzahrani
|
Eman Albilali
|
Nizar Habash
|
Abdelrahman Mustafa El-Sheikh
|
Muhammad Elmallah
|
Hamdy Mubarak
|
Zaid Alyafeai
|
Mohamed Anwar
|
Haonan Li
|
Ahmed Abdelali
|
Nora Altwairesh
|
Maram Hasanain
|
Abdulmohsen Al-Thubaity
|
Shady Shehata
|
Bashar Alhafni
|
Injy Hamed
|
Go Inoue
|
Khalid N. Elmadani
|
Ossama Obeid
|
Fatima Haouari
|
Tamer Elsayed
|
Emad A. Alghamdi
|
Khalid Almubarak
|
Saied Alshahrani
|
Ola Aljareh
|
Safa Alajlan
|
Areej Alshaqarawi
|
Maryam Alshihri
|
Sultana Alghurabi
|
Atikah Alzeghayer
|
Afrah Altamimi
|
Abdullah Alfaifi
|
Abdulrahman M Alosaimy
Proceedings of The Third Arabic Natural Language Processing Conference
The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.
pdf
bib
abs
Exploring the Limitations of Detecting Machine-Generated Text
Jad Doughman
|
Osama Mohammed Afzal
|
Hawau Olamide Toyin
|
Shady Shehata
|
Preslav Nakov
|
Zeerak Talat
Proceedings of the 31st International Conference on Computational Linguistics
Recent improvements in the quality of the generations by large language models have spurred research into identifying machine-generated text. Such work often presents high-performing detectors. However, humans and machines can produce text in different styles and domains, yet the the performance impact of such on machine generated text detection systems remains unclear. In this paper, we audit the classification performance for detecting machine-generated text by evaluating on texts with varying writing styles. We find that classifiers are highly sensitive to stylistic changes and differences in text complexity, and in some cases degrade entirely to random classifiers. We further find that detection systems are particularly susceptible to misclassify easy-to-read texts while they have high performance for complex texts, leading to concerns about the reliability of detection systems. We recommend that future work attends to stylistic factors and reading difficulty levels of human-written and machine-generated text.
pdf
bib
abs
NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities
Abdellah El Mekki
|
Houdaifa Atou
|
Omer Nacar
|
Shady Shehata
|
Muhammad Abdul-Mageed
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter Egyptian and Moroccan Arabic LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. This work addresses Arabic dialect in LLMs with a focus on cultural and values alignment via controlled synthetic data generation and retrieval-augmented pre-training for Moroccan Darija and Egyptian Arabic, including Arabizi variants, advancing Arabic NLP for low-resource communities.We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in cultural LLM development: https://github.com/UBC-NLP/nilechat.
pdf
bib
abs
NurseLLM: The First Specialized Language Model for Nursing
Md Tawkat Islam Khondaker
|
Julia Harrington
|
Shady Shehata
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Recent advancements in large language models (LLMs) have significantly transformed medical systems. However, their potential within specialized domains such as nursing remains largely underexplored. In this work, we introduce NurseLLM, the first nursing-specialized LLM tailored for multiple choice question-answering (MCQ) tasks. We develop a multi-stage data generation pipeline to build the first large scale nursing MCQ dataset to train LLMs on a broad spectrum of nursing topics. We further introduce multiple nursing benchmarks to enable rigorous evaluation. Our extensive experiments demonstrate that NurseLLM outperforms SoTA general-purpose and medical-specialized LLMs of comparable size on different benchmarks, underscoring the importance of a specialized LLM for the nursing domain. Finally, we explore the role of reasoning and multi-agent collaboration systems in nursing, highlighting their promise for future research and applications.
pdf
bib
abs
Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset
Fakhraddin Alwajih
|
Samar M. Magdy
|
Abdellah El Mekki
|
Omer Nacar
|
Youssef Nafea
|
Safaa Taher Abdelfadil
|
Abdulfattah Mohammed Yahya
|
Hamzah Luqman
|
Nada Almarwani
|
Samah Aloufi
|
Baraah Qawasmeh
|
Houdaifa Atou
|
Serry Sibaee
|
Hamzah A. Alsayadi
|
Walid Al-Dhabyani
|
Maged S. Al-shaibani
|
Aya El aatar
|
Nour Qandos
|
Rahaf Alhamouri
|
Samar Ahmad
|
Mohammed Anwar AL-Ghrawi
|
Aminetou Yacoub
|
Ruwa AbuHweidi
|
Vatimetou Mohamed Lemin
|
Reem Abdel-Salam
|
Ahlam Bashiti
|
Adel Ammar
|
Aisha Alansari
|
Ahmed Ashraf
|
Nora Alturayeif
|
Alcides Alcoba Inciarte
|
AbdelRahim A. Elmadany
|
Mohamedou Cheikh Tourad
|
Ismail Berrada
|
Mustafa Jarrar
|
Shady Shehata
|
Muhammad Abdul-Mageed
Findings of the Association for Computational Linguistics: EMNLP 2025
Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce PEARL, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 37 annotators from across the Arab world, PEARL comprises over 309K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks (PEARL and PEARL-LITE) along with a specialized subset (PEARL-X) explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models’ cultural grounding compared to conventional scaling methods. PEARL establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.
pdf
bib
abs
Beyond Content: How Grammatical Gender Shapes Visual Representation in Text-to-Image Models
Muhammed Saeed
|
Shaina Raza
|
Ashmal Vayani
|
Muhammad Abdul-Mageed
|
Ali Emami
|
Shady Shehata
Findings of the Association for Computational Linguistics: EMNLP 2025
Research on bias in Text-to-Image (T2I) models has primarily focused on demographic representation and stereotypical attributes, overlooking a fundamental question: how does grammatical gender influence visual representation across languages? We introduce a cross-linguistic benchmark examining words where grammatical gender contradicts stereotypical gender associations (e.g., “une sentinelle” - grammatically feminine in French but referring to the stereotypically masculine concept “guard”). Our dataset spans five gendered languages (French, Spanish, German, Italian, Russian) and two gender-neutral control languages (English, Chinese), comprising 800 unique prompts that generated 28,800 images across three state-of-the-art T2I models. Our analysis reveals that grammatical gender dramatically influences image generation: masculine grammatical markers increase male representation to 73% on average (compared to 22% with gender-neutral English), while feminine grammatical markers increase female representation to 38% (compared to 28% in English). These effects vary systematically by language resource availability and model architecture, with high-resource languages showing stronger effects. Our findings establish that language structure itself, not just content, shapes AI-generated visual outputs, introducing a new dimension for understanding bias and fairness in multilingual, multimodal systems.
pdf
bib
ASR Models for Traditional Emirati Arabic: Challenges, Adaptations, and Performance Evaluation
Maha Alblooki
|
Kentaro Inui
|
Shady Shehata
Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP-2025)
pdf
bib
abs
JAWAHER: A Multidialectal Dataset of Arabic Proverbs for LLM Benchmarking
Samar Mohamed Magdy
|
Sang Yun Kwon
|
Fakhraddin Alwajih
|
Safaa Taher Abdelfadil
|
Shady Shehata
|
Muhammad Abdul-Mageed
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Recent advancements in instruction fine-tuning, alignment methods such as reinforcement learning from human feedback (RLHF), and optimization techniques like direct preference optimization (DPO), have significantly enhanced the adaptability of large language models (LLMs) to user preferences. However, despite these innovations, many LLMs continue to exhibit biases toward Western, Anglo-centric, or American cultures, with performance on English data consistently surpassing that of other languages. This reveals a persistent cultural gap in LLMs, which complicates their ability to accurately process culturally rich and diverse figurative language, such as proverbs. To address this, we introduce *Jawaher*, a benchmark designed to assess LLMs’ capacity to comprehend and interpret Arabic proverbs. *Jawaher* includes proverbs from various Arabic dialects, along with idiomatic translations and explanations. Through extensive evaluations of both open- and closed-source models, we find that while LLMs can generate idiomatically accurate translations, they struggle with producing culturally nuanced and contextually relevant explanations. These findings highlight the need for ongoing model refinement and dataset expansion to bridge the cultural gap in figurative language processing.
pdf
bib
abs
ASR Under Noise: Exploring Robustness for Sundanese and Javanese
Salsabila Zahirah Pranida
|
Rifo Ahmad Genadi
|
Muhammad Cendekia Airlangga
|
Shady Shehata
Proceedings of the 9th Widening NLP Workshop
We investigate the robustness of Whisper-based automatic speech recognition (ASR) models for two major Indonesian regional languages: Javanese and Sundanese. While recent work has demonstrated strong ASR performance under clean conditions, their effectiveness in noisy environments remains unclear. To address this, we experiment with multiple training strategies, including synthetic noise augmentation and SpecAugment, and evaluate performance across a range of signal-to-noise ratios (SNRs). Our results show that noise-aware training substantially improves robustness, particularly for larger Whisper models. A detailed error analysis further reveals language-specific challenges, highlighting avenues for future improvements.
2024
pdf
bib
abs
Data Augmentation for Speech-Based Diacritic Restoration
Sara Shatnawi
|
Sawsan Alqahtani
|
Shady Shehata
|
Hanan Aldarmaki
Proceedings of the Second Arabic Natural Language Processing Conference
This paper describes a data augmentation technique for boosting the performance of speech-based diacritic restoration. Our experiments demonstrate the utility of this appraoch, resulting in improved generalization of all models across different test sets. In addition, we describe the first multi-modal diacritic restoration model, utilizing both speech and text as input modalities. This type of model can be used to diacritize speech transcripts. Unlike previous work that relies on an external ASR model, the proposed model is far more compact and efficient. While the multi-modal framework does not surpass the ASR-based model for this task, it offers a promising approach for improving the efficiency of speech-based diacritization, with a potential for improvement using data augmentation and other methods.
pdf
bib
abs
From Nile Sands to Digital Hands: Machine Translation of Coptic Texts
Muhammed Saeed
|
Asim Mohamed
|
Mukhtar Mohamed
|
Shady Shehata
|
Muhammad Abdul-Mageed
Proceedings of the Second Arabic Natural Language Processing Conference
The Coptic language, rooted in the historical landscapes of Egypt, continues to serve as a vital liturgical medium for the Coptic Orthodox and Catholic Churches across Egypt, North Sudan, Libya, and the United States, with approximately ten million speakers worldwide. However, the scarcity of digital resources in Coptic has resulted in its exclusion from digital systems, thereby limiting its accessibility and preservation in modern technological contexts. Our research addresses this issue by developing the most extensive parallel Coptic-centered corpus to date. This corpus comprises over 8,000 parallel sentences between Arabic and Coptic, and more than 24,000 parallel sentences between English and Coptic. We have also developed the first neural machine translation system between Coptic, English, and Arabic. Lastly, we evaluate the capability of leading proprietary Large Language Models (LLMs) to translate to and from Coptic using a few-shot learning approach (in-context learning). Our code and data are available at
https://github.com/UBC-NLP/copticmt.
pdf
bib
abs
Casablanca: Data and Models for Multidialectal Arabic Speech Recognition
Bashar Talafha
|
Karima Kadaoui
|
Samar Mohamed Magdy
|
Mariem Habiboullah
|
Chafei Mohamed Chafei
|
Ahmed Oumar El-Shangiti
|
Hiba Zayed
|
Mohamedou Cheikh Tourad
|
Rahaf Alhamouri
|
Rwaa Assi
|
Aisha Alraeesi
|
Hour Mohamed
|
Fakhraddin Alwajih
|
Abdelrahman Mohamed
|
Abdellah El Mekki
|
El Moatez Billah Nagoudi
|
Benelhadj Djelloul Mama Saadia
|
Hamzah A. Alsayadi
|
Walid Al-Dhabyani
|
Sara Shatnawi
|
Yasir Ech-chammakhy
|
Amal Makouar
|
Yousra Berrachedi
|
Mustafa Jarrar
|
Shady Shehata
|
Ismail Berrada
|
Muhammad Abdul-Mageed
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
In spite of the recent progress in speech processing, the majority of world languages and dialects remain uncovered. This situation only furthers an already wide technological divide, thereby hindering technological and socioeconomic inclusion. This challenge is largely due to the absence of datasets that can empower diverse speech systems. In this paper, we seek to mitigate this obstacle for a number of Arabic dialects by presenting Casablanca, a large-scale community-driven effort to collect and transcribe a multi-dialectal Arabic dataset. The dataset covers eight dialects: Algerian, Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni, and includes annotations for transcription, gender, dialect, and code-switching. We also develop a number of strong baselines exploiting Casablanca. The project page for Casablanca is accessible at: www.dlnlp.ai/speech/casablanca.
pdf
bib
abs
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic
Fajri Koto
|
Haonan Li
|
Sara Shatnawi
|
Jad Doughman
|
Abdelrahman Sadallah
|
Aisha Alraeesi
|
Khalid Almubarak
|
Zaid Alyafeai
|
Neha Sengupta
|
Shady Shehata
|
Nizar Habash
|
Preslav Nakov
|
Timothy Baldwin
Findings of the Association for Computational Linguistics: ACL 2024
The focus of language model evaluation has transitioned towards reasoning and knowledge-intensive tasks, driven by advancements in pretraining large models. While state-of-the-art models are partially trained on large Arabic texts, evaluating their performance in Arabic remains challenging due to the limited availability of relevant datasets. To bridge this gap, we present ArabicMMLU, the first multi-task language understanding benchmark for the Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our comprehensive evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models. Notably, BLOOMZ, mT0, LLama2, and Falcon struggle to achieve a score of 50%, while even the top-performing Arabic-centric model only achieves a score of 62.3%.
2023
pdf
bib
abs
Enhancing Video-based Learning Using Knowledge Tracing: Personalizing Students’ Learning Experience with ORBITS
Shady Shehata
|
David Santandreu Calonge
|
Philip Purnell
|
Mark Thompson
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
As the world regains its footing following the COVID-19 pandemic, academia is striving to consolidate the gains made in students’ education experience. New technologies such as video-based learning have shown some early improvement in student learning and engagement. In this paper, we present ORBITS predictive engine at YOURIKA company, a video-based student support platform powered by knowledge tracing. In an exploratory case study of one master’s level Speech Processing course at the Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) in Abu Dhabi, half the students used the system while the other half did not. Student qualitative feedback was universally positive and compared the system favorably against current available methods. These findings support the use of artificial intelligence techniques to improve the student learning experience.
pdf
bib
abs
Detecting Propaganda Techniques in Code-Switched Social Media Text
Muhammad Umar Salman
|
Asif Hanif
|
Shady Shehata
|
Preslav Nakov
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Propaganda is a form of communication intended to influence the opinions and the mindset of the public to promote a particular agenda. With the rise of social media, propaganda has spread rapidly, leading to the need for automatic propaganda detection systems. Most work on propaganda detection has focused on high-resource languages, such as English, and little effort has been made to detect propaganda for low-resource languages. Yet, it is common to find a mix of multiple languages in social media communication, a phenomenon known as code-switching. Code-switching combines different languages within the same text, which poses a challenge for automatic systems. Considering this premise, we propose a novel task of detecting propaganda techniques in code-switched text. To support this task, we create a corpus of 1,030 texts code-switching between English and Roman Urdu, annotated with 20 propaganda techniques at fragment-level. We perform a number of experiments contrasting different experimental setups, and we find that it is important to model the multilinguality directly rather than using translation as well as to use the right fine-tuning strategy. We plan to publicly release our code and dataset.
pdf
bib
abs
Can a Prediction’s Rank Offer a More Accurate Quantification of Bias? A Case Study Measuring Sexism in Debiased Language Models
Jad Doughman
|
Shady Shehata
|
Leen Al Qadi
|
Youssef Nafea
|
Fakhri Karray
Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems
Pre-trained language models are known to inherit a plethora of contextual biases from their training data. These biases have proven to be projected onto a variety of downstream applications, making their detection and mitigation imminent. Limited research has been conducted to quantify specific bias types, such as benevolent sexism, which may be subtly present within the inferred connotations of a sentence. To this extent, our work aims to: (1) provide a benchmark of sexism sentences; (2) adapt two bias metrics: mean probability score and mean normalized rank; (3) conduct a case study to quantify and analyze sexism in base and de-biased masked language models. We find that debiasing, even in its most effective form (Auto-Debias), solely nullifies the probability score of biasing tokens, while retaining them in high ranks. Auto-Debias illustrates a 90%-96% reduction in mean probability scores from base to debiased models, while only a 3%-16% reduction in mean normalized ranks. Similar to the application of non-parametric statistical tests for data that does not follow a normal distribution, operating on the ranks of predictions rather than their probability scores offers a more representative bias measure.
pdf
bib
abs
AraDiaWER: An Explainable Metric For Dialectical Arabic ASR
Abdulwahab Sahyoun
|
Shady Shehata
Proceedings of the Second Workshop on NLP Applications to Field Linguistics
Linguistic variability poses a challenge to many modern ASR systems, particularly Dialectical Arabic (DA) ASR systems dealing with low-resource dialects and resulting morphological and orthographic variations in text and speech. Traditional evaluation metrics such as the word error rate (WER) inadequately capture these complexities, leading to an incomplete assessment of DA ASR performance. We propose AraDiaWER, an ASR evaluation metric for Dialectical Arabic (DA) speech recognition systems, focused on the Egyptian dialect. AraDiaWER uses language model embeddings for the syntactic and semantic aspects of ASR errors to identify their root cause, not captured by traditional WER. MiniLM generates the semantic score, capturing contextual differences between reference and predicted transcripts. CAMeLBERT-Mix assigns morphological and lexical tags using a fuzzy matching algorithm to calculate the syntactic score. Our experiments validate the effectiveness of AraDiaWER. By incorporating language model embeddings, AraDiaWER enables a more interpretable evaluation, allowing us to improve DA ASR systems. We position the proposed metric as a complementary tool to WER, capturing syntactic and semantic features not represented by WER. Additionally, we use UMAP analysis to observe the quality of ASR embeddings in the proposed evaluation framework.