Serry Sibaee

2025

As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce PALM, a year-long community-driven project covering all 22 Arab countries. The dataset contains instruction–response pairs in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world—each an author of this paper—PALM offers a broad, inclusive perspective. We use PALM to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations: while closed-source LLMs generally perform strongly, they still exhibit flaws, and smaller open-source models face greater challenges. Furthermore, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data are publicly available for reproducibility. More information about PALM is available on our project page: https://github.com/UBC-NLP/palm.

pdf bib

ANLPers at AraGenEval Shared Task: Descriptive Author Tokens for Transparent Arabic Authorship Style Transfer
Omer Nacar | Mahmoud Reda | Serry Sibaee | Yasser Alhabashi | Adel Ammar | Wadii Boulila
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

pdf bib

ANLPers at BAREC Shared Task 2025: Readability of Embeddings Training Neural Readability Classifiers on the BAREC Corpus
Serry Sibaee | Omer Nacar | Yasser Alhabashi | Adel Ammar | Wadii Boulila
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

pdf bib

ANPLers at IqraEval Shared task: Adapting Whisper-large-v3 as Speech-to-Phoneme for Qur’anic Recitation Mispronunciation Detection
Nour Qandos | Serry Sibaee | Samar Ahmad | Omer Nacar | Adel Ammar | Wadii Boulila | Yasser Alhabashi
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

pdf bib

ANLPers at MAHED2025: From Hate to Hope: Boosting Arabic Text Classification
Yasser Alhabashi | Serry Sibaee | Omer Nacar | Adel Ammar | Wadii Boulila
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

pdf bib

pdf bib abs

Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce PEARL, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 37 annotators from across the Arab world, PEARL comprises over 309K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks (PEARL and PEARL-LITE) along with a specialized subset (PEARL-X) explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models’ cultural grounding compared to conventional scaling methods. PEARL establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.

2024

pdf bib abs

ASOS at ArAIEval Shared Task: Integrating Text and Image Embeddings for Multimodal Propaganda Detection in Arabic Memes
Yasser Alhabashi | Abdullah Alharbi | Samar Ahmad | Serry Sibaee | Omer Nacar | Lahouari Ghouti | Anis Koubaa
Proceedings of the Second Arabic Natural Language Processing Conference

This paper describes our participation in the ArAIEval Shared Task 2024, focusing on Task 2C, which challenges participants to detect propagandistic elements in multimodal Arabic memes. The challenge involves analyzing both the textual and visual components of memes to identify underlying propagandistic messages. Our approach integrates the capabilities of MARBERT and ResNet50, top-performing pre-trained models for text and image processing, respectively. Our system architecture combines these models through a fusion layer that integrates and processes the extracted features, creating a comprehensive representation that is more effective in detecting nuanced propaganda. Our proposed system achieved significant success, placing second with an F1 score of 0.7987.

pdf bib abs

Semantic search tasks have grown extremely fast following the advancements in large language models, including the Reverse Dictionary and Word Sense Disambiguation in Arabic. This paper describes our participation in the Contemporary Arabic Dictionary Shared Task. We propose two models that achieved first place in both tasks. We conducted comprehensive experiments on the latest five multilingual sentence transformers and the Arabic BERT model for semantic embedding extraction. We achieved a ranking score of 0.06 for the reverse dictionary task, which is double than last year’s winner. We had an accuracy score of 0.268 for the Word Sense Disambiguation task.

pdf bib abs

ASOS at NADI 2024 shared task: Bridging Dialectness Estimation and MSA Machine Translation for Arabic Language Enhancement
Omer Nacar | Serry Sibaee | Abdullah Alharbi | Lahouari Ghouti | Anis Koubaa
Proceedings of the Second Arabic Natural Language Processing Conference

This study undertakes a comprehensive investigation of transformer-based models to advance Arabic language processing, focusing on two pivotal aspects: the estimation of Arabic Level of Dialectness and dialectal sentence-level machine translation into Modern Standard Arabic. We conducted various evaluations of different sentence transformers across a proposed regression model, showing that the MARBERT transformer-based proposed regression model achieved the best root mean square error of 0.1403 for Arabic Level of Dialectness estimation. In parallel, we developed bi-directional translation models between Modern Standard Arabic and four specific Arabic dialects—Egyptian, Emirati, Jordanian, and Palestinian—by fine-tuning and evaluating different sequence-to-sequence transformers. This approach significantly improved translation quality, achieving a BLEU score of 0.1713. We also enhanced our evaluation capabilities by integrating MSA predictions from the machine translation model into our Arabic Level of Dialectness estimation framework, forming a comprehensive pipeline that not only demonstrates the effectiveness of our methodologies but also establishes a new benchmark in the deployment of advanced Arabic NLP technologies.

pdf bib abs

ASOS at OSACT6 Shared Task: Investigation of Data Augmentation in Arabic Dialect-MSA Translation
Omer Nacar | Abdullah Alharbi | Serry Sibaee | Samar Ahmed | Lahouari Ghouti | Anis Koubaa
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

The translation between Modern Standard Arabic (MSA) and the various Arabic dialects presents unique challenges due to the significant linguistic, cultural, and contextual variations across the regions where Arabic is spoken. This paper presents a system description of our participation in the OSACT 2024 Dialect to MSA Translation Shared Task. We explain our comprehensive approach which combines data augmentation techniques using generative pre-trained transformer models (GPT-3.5 and GPT-4) with fine-tuning of AraT5 V2, a model specifically designed for Arabic translation tasks. Our methodology has significantly expanded the training dataset, thus improving the model’s performance across five major Arabic dialects, namely Gulf, Egyptian, Levantine, Iraqi, and Maghrebi. We have rigorously evaluated our approach, using BLEU score, to ensure translation accuracy, fluency, and the preservation of meaning. Our results showcase the effectiveness of our refined models in addressing the challenges posed by diverse Arabic dialects and Modern Standard Arabic (MSA), achieving a BLEU score of 80% on the validation test set and 22.25% on the blind test set. However, it’s important to note that while utilizing a larger dataset, such as Madar + Dev, resulted in significantly higher evaluation BLEU scores, the performance on the blind test set was relatively lower. This observation underscores the importance of dataset size in model training, revealing potential limitations in generalization to unseen data due to variations in data distribution and domain mismatches.

2023

pdf bib abs

A reverse dictionary takes a descriptive phrase of a particular concept and returns words with definitions that align with that phrase. While many reverse dictionaries cater to languages such as English and are readily available online or have been developed by researchers, there is a notable lack of similar resources for the Arabic language. This paper describes our participation in the Arabic Reverse Dictionary shared task. Our proposed method consists of two main steps: First, we convert word definitions into multidimensional vectors. Then, we train these encoded vectors using the Semi-Decoder model for our target task. Our system secured 2nd place based on the Rank metric for both embeddings (Electra and Sgns).

pdf bib abs

The rapid proliferation of disinformation through social media has become one of the most dangerous means to deceive and influence people’s thoughts, viewpoints, or behaviors due to social media’s facilities, such as rapid access, lower cost, and ease of use. Disinformation can spread through social media in different ways, such as fake news stories, doctored images or videos, deceptive data, and even conspiracy theories, thus making detecting disinformation challenging. This paper is a part of participation in the ArAIEval competition that relates to disinformation detection. This work evaluated four models: MARBERT, the proposed ensemble model, and two tests over GPT-4 (zero-shot and Few-shot). GPT-4 achieved micro-F1 79.01% while the ensemble method obtained 76.83%. Despite no improvement in the micro-F1 score on the dev dataset using the ensemble approach, we still used it for the test dataset predictions. We believed that merging different classifiers might enhance the system’s prediction accuracy.