Salmane Chafik


2025

pdf bib
The AraGenEval Shared Task on Arabic Authorship Style Transfer and AI Generated Text Detection
Shadi Abudalfa | Saad Ezzini | Ahmed Abdelali | Hamza Alami | Abdessamad Benlahbib | Salmane Chafik | Mo El-Haj | Abdelkader El Mahdaouy | Mustafa Jarrar | Salima Lamsiyah | Hamzah Luqman
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

We present an overview of the AraGenEval shared task, organized as part of the ArabicNLP 2025 conference. This task introduced the first benchmark suite for Arabic authorship analysis, featuring three subtasks: Authorship Style Transfer, Authorship Identification, and AI-Generated Text Detection. We curated high-quality datasets, including over 47,000 paragraphs from 21 authors and a balanced corpus of human- and AI-generated texts. The task attracted significant global participation, with 72 registered teams from 16 countries. The results highlight the effectiveness of transformer-based models, with top systems leveraging prompt engineering for style transfer, model ensembling for authorship identification, and a mix of multilingual and Arabic-specific models for AI text detection. This paper details the task design, datasets, participant systems, and key findings, establishing a foundation for future research in Arabic stylistics and trustworthy NLP.

pdf bib
AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP
Ahmed Abul Hasanaath | Aisha Alansari | Ahmed Ashraf | Salmane Chafik | Hamzah Luqman | Saad Ezzini
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks—boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at https://anonymous.4open.science/r/AraReasoner41299

pdf bib
Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects
Maram Alharbi | Salmane Chafik | Saad Ezzini | Ruslan Mitkov | Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects

pdf bib
AHaSIS: Shared Task on Sentiment Analysis for Arabic Dialects
Maram I. Alharbi | Salmane Chafik | Saad Ezzini | Ruslan Mitkov | Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects

The hospitality industry in the Arab world increasingly relies on customer feedback to shape services, driving the need for advanced Arabic sentiment analysis tools. To address this challenge, the Sentiment Analysis on Arabic Dialects in the Hospitality Domain shared task focuses on Sentiment Detection in Arabic Dialects. This task leverages a multi-dialect, manually curated dataset derived from hotel reviews originally written in Modern Standard Arabic (MSA) and translated into Saudi and Moroccan (Darija) dialects. The dataset consists of 538 sentiment-balanced reviews spanning positive, neutral, and negative categories. Translations were validated by native speakers to ensure dialectal accuracy and sentiment preservation. This resource supports the development of dialect-aware NLP systems for real-world applications in customer experience analysis. More than 40 teams have registered for the shared task, with 12 submitting systems during the evaluation phase. The top-performing system achieved an F1 score of 0.81, demonstrating the feasibility and ongoing challenges of sentiment analysis across Arabic dialects.

pdf bib
Proceedings of the Shared Task on Multi-Domain Detection of AI-Generated Text
Salima Lamsiyah | Saad Ezzini | Abdelkader El Mahdaoui | Hamza Alami | Abdessamad Benlahbib | Samir El Amrani | Salmane Chafik | Hicham Hammouchi
Proceedings of the Shared Task on Multi-Domain Detection of AI-Generated Text

pdf bib
M-DAIGT: A Shared Task on Multi-Domain Detection of AI-Generated Text
Salima Lamsiyah | Saad Ezzini | Abdelkader El Mahdaouy | Hamza Alami | Abdessamad Benlahbib | Samir El amrany | Salmane Chafik | Hicham Hammouchi
Proceedings of the Shared Task on Multi-Domain Detection of AI-Generated Text

The generation of highly fluent text by Large Language Models (LLMs) poses a significant challenge to information integrity and academic research. In this paper, we introduce the Multi-Domain Detection of AI-Generated Text (M-DAIGT) shared task, which focuses on detecting AI-generated text across multiple domains, particularly in news articles and academic writing. M-DAIGT comprises two binary classification subtasks: News Article Detection (NAD) (Subtask 1) and Academic Writing Detection (AWD) (Subtask 2). To support this task, we developed and released a new large-scale benchmark dataset of 30,000 samples, balanced between human-written and AI-generated texts. The AI-generated content was produced using a variety of modern LLMs (e.g., GPT-4, Claude) and diverse prompting strategies. A total of 46 unique teams registered for the shared task, of which four teams submitted final results. All four teams participated in both Subtask 1 and Subtask 2. We describe the methods employed by these participating teams and briefly discuss future directions for M-DAIGT.

pdf bib
Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija
Salmane Chafik | Saad Ezzini | Ismail Berrada
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)

The task of converting natural language questions into executable SQL queries, known as text-to-SQL, has gained significant interest in recent years, as it enables non-technical users to interact with relational databases. Many benchmarks, such as SPIDER and WikiSQL, have contributed to the development of new models and the evaluation of their performance. In addition, other datasets, like SEDE and BIRD, have introduced more challenges and complexities to better map real-world scenarios. However, these datasets primarily focus on high-resource languages such as English and Chinese. In this work, we introduce Dialect2SQL, the first large-scale, cross-domain text-to-SQL dataset in an Arabic dialect. It consists of 9,428 NLQ-SQL pairs across 69 databases in various domains. Along with SQL-related challenges such as long schemas, dirty values, and complex queries, our dataset also incorporates the complexities of the Moroccan dialect, which is known for its diverse source lan-guages, numerous borrowed words, and unique expressions. This demonstrates that our dataset will be a valuable contribution to both the text-to-SQL community and the development of resources for low-resource languages.