Hanin Atwany
2026
JEEM: Vision-Language Understanding in Four Arabic Dialects
Karima Kadaoui | Hanin Atwany | Hamdan Al-Ali | Abdelrahman Mohamed | Ali Mekky | Sergei Tilga | Natalia Fedorova | Ekaterina Artemova | Hanan Aldarmaki | Yova Kementchedjhieva
Findings of the Association for Computational Linguistics: EACL 2026
Karima Kadaoui | Hanin Atwany | Hamdan Al-Ali | Abdelrahman Mohamed | Ali Mekky | Sergei Tilga | Natalia Fedorova | Ekaterina Artemova | Hanan Aldarmaki | Yova Kementchedjhieva
Findings of the Association for Computational Linguistics: EACL 2026
We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4o, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4o ranks best in this comparison, the model’s linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.
2025
Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models
Hanin Atwany | Abdul Waheed | Rita Singh | Monojit Choudhury | Bhiksha Raj
Findings of the Association for Computational Linguistics: ACL 2025
Hanin Atwany | Abdul Waheed | Rita Singh | Monojit Choudhury | Bhiksha Raj
Findings of the Association for Computational Linguistics: ACL 2025
Speech foundation models trained at a massive scale, both in terms of model and data size, result in robust systems capable of performing multiple speech tasks, including automatic speech recognition (ASR). These models transcend language and domain barriers, yet effectively measuring their performance remains a challenge. Traditional metrics like word error rate (WER) and character error rate (CER) are commonly used to evaluate ASR performance but often fail to reflect transcription quality in critical contexts, particularly when detecting fabricated outputs. This phenomenon, known as hallucination, is especially concerning in high-stakes domains such as healthcare, legal, and aviation, where errors can have severe consequences. In our work, we address this gap by investigating hallucination in ASR models. We examine how factors such as distribution shifts, model size, and model architecture influence the hallucination error rate (HER), a metric we introduce to quantify hallucinations. Our analysis of over 20 ASR models reveals key insights: (1) High WERs can mask low hallucination rates, while low WERs may conceal dangerous hallucinations. (2) Synthetic noise, both adversarial and common perturbations like white noise, pitch shift, and time stretching, increase HER. (3) Distribution shift correlates strongly with HER (𝛼 = 0.91). Our findings highlight the importance of incorporating HER alongside traditional metrics like WER to better assess ASR model performance, particularly in high-stakes domains.
On the Robust Approximation of ASR Metrics
Abdul Waheed | Hanin Atwany | Rita Singh | Bhiksha Raj
Findings of the Association for Computational Linguistics: ACL 2025
Abdul Waheed | Hanin Atwany | Rita Singh | Bhiksha Raj
Findings of the Association for Computational Linguistics: ACL 2025
Recent advances in speech foundation models are largely driven by scaling both model size and data, enabling them to perform a wide range of tasks, including speech recognition. Traditionally, ASR models are evaluated using metrics like Word Error Rate (WER) and Character Error Rate (CER), which depend on ground truth labels. As a result of limited labeled data from diverse domains and testing conditions, the true generalization capabilities of these models beyond standard benchmarks remain unclear. Moreover, labeling data is both costly and time-consuming. To address this, we propose a novel label-free approach for approximating ASR performance metrics, eliminating the need for ground truth labels. Our method utilizes multimodal embeddings in a unified space for speech and transcription representations, combined with a high-quality proxy model to compute proxy metrics. These features are used to train a regression model to predict key ASR metrics like Word Error Rate (WER) and Character Error Rate (CER). We experiment with over 40 models across 14 datasets representing both standard and in-the-wild testing conditions. Our results show that we approximate the metrics within a single-digit absolute difference across all experimental configurations, outperforming the most recent baseline by more than 50%.
Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Fakhraddin Alwajih | Abdellah El Mekki | Samar Mohamed Magdy | AbdelRahim A. Elmadany | Omer Nacar | El Moatez Billah Nagoudi | Reem Abdel-Salam | Hanin Atwany | Youssef Nafea | Abdulfattah Mohammed Yahya | Rahaf Alhamouri | Hamzah A. Alsayadi | Hiba Zayed | Sara Shatnawi | Serry Sibaee | Yasir Ech-chammakhy | Walid Al-Dhabyani | Marwa Mohamed Ali | Imen Jarraya | Ahmed Oumar El-Shangiti | Aisha Alraeesi | Mohammed Anwar AL-Ghrawi | Abdulrahman S. Al-Batati | Elgizouli Mohamed | Noha Taha Elgindi | Muhammed Saeed | Houdaifa Atou | Issam Ait Yahia | Abdelhak Bouayad | Mohammed Machrouh | Amal Makouar | Dania Alkawi | Mukhtar Mohamed | Safaa Taher Abdelfadil | Amine Ziad Ounnoughene | Anfel Rouabhia | Rwaa Assi | Ahmed Sorkatti | Mohamedou Cheikh Tourad | Anis Koubaa | Ismail Berrada | Mustafa Jarrar | Shady Shehata | Muhammad Abdul-Mageed
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fakhraddin Alwajih | Abdellah El Mekki | Samar Mohamed Magdy | AbdelRahim A. Elmadany | Omer Nacar | El Moatez Billah Nagoudi | Reem Abdel-Salam | Hanin Atwany | Youssef Nafea | Abdulfattah Mohammed Yahya | Rahaf Alhamouri | Hamzah A. Alsayadi | Hiba Zayed | Sara Shatnawi | Serry Sibaee | Yasir Ech-chammakhy | Walid Al-Dhabyani | Marwa Mohamed Ali | Imen Jarraya | Ahmed Oumar El-Shangiti | Aisha Alraeesi | Mohammed Anwar AL-Ghrawi | Abdulrahman S. Al-Batati | Elgizouli Mohamed | Noha Taha Elgindi | Muhammed Saeed | Houdaifa Atou | Issam Ait Yahia | Abdelhak Bouayad | Mohammed Machrouh | Amal Makouar | Dania Alkawi | Mukhtar Mohamed | Safaa Taher Abdelfadil | Amine Ziad Ounnoughene | Anfel Rouabhia | Rwaa Assi | Ahmed Sorkatti | Mohamedou Cheikh Tourad | Anis Koubaa | Ismail Berrada | Mustafa Jarrar | Shady Shehata | Muhammad Abdul-Mageed
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce PALM, a year-long community-driven project covering all 22 Arab countries. The dataset contains instruction–response pairs in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world—each an author of this paper—PALM offers a broad, inclusive perspective. We use PALM to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations: while closed-source LLMs generally perform strongly, they still exhibit flaws, and smaller open-source models face greater challenges. Furthermore, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data are publicly available for reproducibility. More information about PALM is available on our project page: https://github.com/UBC-NLP/palm.
2024
John vs. Ahmed: Debate-Induced Bias in Multilingual LLMs
Anastasiia Demidova | Hanin Atwany | Nour Rabih | Sanad Sha’ban | Muhammad Abdul-Mageed
Proceedings of the Second Arabic Natural Language Processing Conference
Anastasiia Demidova | Hanin Atwany | Nour Rabih | Sanad Sha’ban | Muhammad Abdul-Mageed
Proceedings of the Second Arabic Natural Language Processing Conference
Large language models (LLMs) play a crucial role in a wide range of real world applications. However, concerns about their safety and ethical implications are growing. While research on LLM safety is expanding, there is a noticeable gap in evaluating safety across multiple languages, especially in Arabic and Russian. We address this gap by exploring biases in LLMs across different languages and contexts, focusing on GPT-3.5 and Gemini. Through carefully designed argument-based prompts and scenarios in Arabic, English, and Russian, we examine biases in cultural, political, racial, religious, and gender domains. Our findings reveal biases in these domains. In particular, our investigation uncovers subtle biases where each model tends to present winners as those speaking the primary language the model is prompted with. Our study contributes to ongoing efforts to ensure justice and equality in LLM development and emphasizes the importance of further research towards responsible progress in this field.
OSACT 2024 Task 2: Arabic Dialect to MSA Translation
Hanin Atwany | Nour Rabih | Ibrahim Mohammed | Abdul Waheed | Bhiksha Raj
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
Hanin Atwany | Nour Rabih | Ibrahim Mohammed | Abdul Waheed | Bhiksha Raj
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
We present the results of Shared Task “Dialect to MSA Translation”, which tackles challenges posed by the diverse Arabic dialects in machine translation. Covering Gulf, Egyptian, Levantine, Iraqi and Maghrebi dialects, the task offers 1001 sentences in both MSA and dialects for fine-tuning, alongside 1888 blind test sentences. Leveraging GPT-3.5, a state-of-the-art language model, our method achieved the a BLEU score of 29.61. This endeavor holds significant implications for Neural Machine Translation (NMT) systems targeting low-resource langu ages with linguistic variation. Additionally, negative experiments involving fine-tuning AraT5 and No Language Left Behind (NLLB) using the MADAR Dataset resulted in BLEU scores of 10.41 and 11.96, respectively. Future directions include expanding the dataset to incorporate more Arabic dialects and exploring alternative NMT architectures to further enhance translation capabilities.
Arabic Train at NADI 2024 shared task: LLMs’ Ability to Translate Arabic Dialects into Modern Standard Arabic
Anastasiia Demidova | Hanin Atwany | Nour Rabih | Sanad Sha’ban
Proceedings of the Second Arabic Natural Language Processing Conference
Anastasiia Demidova | Hanin Atwany | Nour Rabih | Sanad Sha’ban
Proceedings of the Second Arabic Natural Language Processing Conference
Navigating the intricacies of machine translation (MT) involves tackling the nuanced disparities between Arabic dialects and Modern Standard Arabic (MSA), presenting a formidable obstacle. In this study, we delve into Subtask 3 of the NADI shared task (CITATION), focusing on the translation of sentences from four distinct Arabic dialects into MSA. Our investigation explores the efficacy of various models, including Jais, NLLB, GPT-3.5, and GPT-4, in this dialect-to-MSA translation endeavor. Our findings reveal that Jais surpasses all other models, boasting an average BLEU score of 19.48 in the combination of zero- and few-shot setting, whereas NLLB exhibits the least favorable performance, garnering a BLEU score of 8.77.
Search
Fix author
Co-authors
- Nour Rabih 3
- Bhiksha Raj 3
- Abdul Waheed 3
- Muhammad Abdul-Mageed 2
- Anastasiia Demidova 2
- Sanad Sha’ban 2
- Rita Singh 2
- Mohammed Anwar AL-Ghrawi 1
- Reem Abdel-Salam 1
- Safaa Taher Abdelfadil 1
- Hamdan Al-Ali 1
- Abdulrahman S. Al-Batati 1
- Walid Al-Dhabyani 1
- Hanan Aldarmaki 1
- Rahaf Alhamouri 1
- Marwa Mohamed Ali 1
- Dania Alkawi 1
- Aisha Alraeesi 1
- Hamzah A. Alsayadi 1
- Fakhraddin Alwajih 1
- Ekaterina Artemova 1
- Rwaa Assi 1
- Houdaifa Atou 1
- Ismail Berrada 1
- Abdelhak Bouayad 1
- Monojit Choudhury 1
- Yasir Ech-chammakhy 1
- Abdellah El Mekki 1
- Ahmed Oumar El-Shangiti 1
- Noha Taha Elgindi 1
- AbdelRahim A. Elmadany 1
- Natalia Fedorova 1
- Mustafa Jarrar 1
- Imen Jarraya 1
- Karima Kadaoui 1
- Yova Kementchedjhieva 1
- Anis Koubaa 1
- Mohammed Machrouh 1
- Samar Mohamed Magdy 1
- Amal Makouar 1
- Ali Mekky 1
- Abdelrahman Mohamed 1
- Elgizouli Mohamed 1
- Mukhtar Mohamed 1
- Ibrahim Mohammed 1
- Omer Nacar 1
- Youssef Nafea 1
- El-Moatez-Billah Nagoudi 1
- Amine Ziad Ounnoughene 1
- Anfel Rouabhia 1
- Muhammed Saeed 1
- Sara Shatnawi 1
- Shady Shehata 1
- Serry Sibaee 1
- Ahmed Sorkatti 1
- Sergei Tilga 1
- Mohamedou Cheikh Tourad 1
- Issam Ait Yahia 1
- Abdulfattah Mohammed Yahya 1
- Hiba Zayed 1