Samar Ahmed

2025

Arabic Large Language Models are usually evaluated using Western-centric benchmarks that overlook essential cultural contexts, making them less effective and culturally misaligned for Arabic-speaking communities. This study addresses this gap by evaluating the Arabic Massive Multitask Language Understanding (MMLU) Benchmark to assess its cultural alignment and relevance for Arabic Large Language Models (LLMs) across culturally sensitive topics. A team of eleven experts annotated over 2,500 questions, evaluating them based on fluency, adequacy, cultural appropriateness, bias detection, religious sensitivity, and adherence to social norms. Through human assessment, the study highlights significant cultural misalignments and biases, particularly in sensitive areas like religion and morality. In response to these findings, we propose annotation guidelines and integrate culturally enriched data sources to enhance the benchmark’s reliability and relevance. The research highlights the importance of cultural sensitivity in evaluating inclusive Arabic LLMs, fostering more widely accepted LLMs for Arabic-speaking communities.

2024

pdf bib abs
ASOS at OSACT6 Shared Task: Investigation of Data Augmentation in Arabic Dialect-MSA Translation
Omer Nacar | Abdullah Alharbi | Serry Sibaee | Samar Ahmed | Lahouari Ghouti | Anis Koubaa
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

The translation between Modern Standard Arabic (MSA) and the various Arabic dialects presents unique challenges due to the significant linguistic, cultural, and contextual variations across the regions where Arabic is spoken. This paper presents a system description of our participation in the OSACT 2024 Dialect to MSA Translation Shared Task. We explain our comprehensive approach which combines data augmentation techniques using generative pre-trained transformer models (GPT-3.5 and GPT-4) with fine-tuning of AraT5 V2, a model specifically designed for Arabic translation tasks. Our methodology has significantly expanded the training dataset, thus improving the model’s performance across five major Arabic dialects, namely Gulf, Egyptian, Levantine, Iraqi, and Maghrebi. We have rigorously evaluated our approach, using BLEU score, to ensure translation accuracy, fluency, and the preservation of meaning. Our results showcase the effectiveness of our refined models in addressing the challenges posed by diverse Arabic dialects and Modern Standard Arabic (MSA), achieving a BLEU score of 80% on the validation test set and 22.25% on the blind test set. However, it’s important to note that while utilizing a larger dataset, such as Madar + Dev, resulted in significantly higher evaluation BLEU scores, the performance on the blind test set was relatively lower. This observation underscores the importance of dataset size in model training, revealing potential limitations in generalization to unseen data due to variations in data distribution and domain mismatches.

pdf bib abs
ASOS at Arabic LLMs Hallucinations 2024: Can LLMs detect their Hallucinations :)
Serry Taiseer Sibaee | Abdullah I. Alharbi | Samar Ahmed | Omar Nacar | Lahouri Ghouti | Anis Koubaa
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

This research delves into the issue of hallucination detection in Large Language Models (LLMs) using Arabic language datasets. As LLMs are increasingly being used in various applications, the phenomenon of hallucination, which refers to generating factually inaccurate content despite grammatical coherence, poses significant challenges. We participate in the OSACT 2024 Shared-task (Detection of Hallucination in Arabic Factual Claims Generated by ChatGPT and GPT4). We explore various approaches for detecting and mitigating hallucination, using models such as GPT-4, Mistral, and Gemini within a novel experimental framework. Our research findings reveal that the effectiveness of these models in classifying claims into Fact-Claim, Fact-Improvement, and Non-Fact categories varies greatly, underscoring the complexities of addressing hallucination in morphologically rich languages. The study emphasizes the need for advanced modelling and training strategies to enhance the reliability and factual accuracy of LLM-generated content, laying the groundwork for future explorations in mitigating hallucination risks. In our experiments we achieved a 0.54 F1 in GPT-4 LLM.