Saad Ezzini - ACL Anthology

Saad Ezzini

2026

AbjadStyleTransfer: Authorship Style Transfer for Arabic-Script Languages at AbjadNLP 2026
Shadi Abudalfa | Saad Ezzini | Ahmed Abdelali | Mustafa Jarrar | Mo El-Haj | Nadir Durrani | Hassan Sajjad | Farah Adeeba
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

Authorship style transfer aims to rewrite a given text so that it reflects the distinctive style of a target author while preserving the original meaning. Despite growing interest in text style transfer, most existing work has focused on English and other high-resource languages, with limited attention to languages written in the Arabic script. In this paper, we present an overview of AbjadStyleTransfer, a shared task organised as part of the AbjadNLP workshop at EACL 2026, which targets authorship style transfer for Arabic-script languages with a strong focus on literary text. The shared task covers Modern Standard Arabic and Urdu, and is designed to encourage research on controllable text generation in morphologically rich and stylistically diverse languages. Participants are required to generate text that conforms to the writing style of a specified author, given a semantically equivalent formal input. We describe the task motivation, dataset construction, evaluation protocol, and participation statistics, and provide an initial discussion of the challenges associated with authorship style transfer in Arabic-script languages. AbjadStyleTransfer establishes a new benchmark for literary style transfer beyond Latin-script settings and supports future research on culturally grounded and linguistically informed text generation.

Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Mo El-Haj | Paul Rayson | Mustafa Jarrar | Ignatius Ezeani | Saad Ezzini | Sina Ahmadi | Amal Haddad Haddad | Cynthia Amol | Ahmad Abdelali | Shadi Abudalfa
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

Are Small Language Models the Silver Bullet to Low-Resource Languages Machine Translation?
Yewei Song | Lujun Li | Cedric Lothritz | Saad Ezzini | Lama Sleem | Niccolo' Gentile | Radu State | Tegawendé F. Bissyandé | Jacques Klein
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)

Small language models (SLMs) offer computationally efficient alternatives to large language models, yet their translation quality for low-resource languages (LRLs) remains severely limited. This work presents the first large-scale evaluation of SLMs across 200 languages, revealing systematic underperformance in LRLs and identifying key sources of linguistic disparity. We show that knowledge distillation from strong teacher models using predominantly monolingual LRL data substantially boosts SLM translation quality—often enabling 2B–3B models to match or surpass systems up to 70B parameters. Our study highlights three core findings: (1) a comprehensive benchmark exposing the limitations of SLMs on 200 languages; (2) evidence that LRL-focused distillation improves translation without inducing catastrophic forgetting, with full-parameter fine-tuning and decoder-only teachers outperforming LoRA and encoder–decoder approaches; and (3) consistent cross-lingual gains demonstrating the scalability and robustness of the method. These results establish an effective, low-cost pathway for improving LRL translation and provide practical guidance for deploying SLMs in truly low-resource settings.

AbjadAuthorID: Authorship Identification for Arabic-Script Languages at AbjadNLP 2026
Shadi Abudalfa | Saad Ezzini | Ahmed Abdelali | Mustafa Jarrar | Mo El-Haj | Nadir Durrani | Hassan Sajjad | Farah Adeeba | Sina Ahmadi
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

Authorship identification is a core problem in Natural Language Processing and computational linguistics, with applications spanning digital humanities, literary analysis, and forensic linguistics. While substantial progress has been made for English and other high-resource languages, authorship attribution for languages written in the Arabic (Abjad) script remains underexplored. In this paper, we present an overview of AbjadAuthorID, a shared task organised as part of the AbjadNLP workshop at EACL 2026, which focuses on multiclass authorship identification across Arabic-script languages. The shared task covers Modern Standard Arabic, Urdu, and Kurdish, and is formulated as a closed-set multiclass classification problem over literary text spanning multiple authors and historical periods. We describe the task motivation, dataset construction, evaluation protocol, and participation statistics, and report official results for the Arabic track. The findings highlight both the effectiveness of current approaches in controlled settings and the challenges posed by lower participation and resource availability in some language tracks. AbjadAuthorID establishes a new benchmark for multilingual authorship attribution in morphologically rich, underrepresented languages.

AbjadGenEval: Abjad AI Generated Text Detection Shared Task for Languages Using Arabic Script at AbjadNLP 2026
Saad Ezzini | Irfan Ahmad | Salmane Chafik | Shadi Abudalfa | Mo El-Haj | Ahmed Abdelali | Mustafa Jarrar | Nadir Durrani | Hassan Sajjad | Farah Adeeba
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

We present the findings of the AbjadGenEval shared task, organized as part of the AbjadNLP workshop at EACL 2026, which benchmarks AI-generated text detection for Arabic-script languages. Extending beyond Arabic to include Urdu, the task serves as a binary classification platform distinguishing human-written from AI-generated news articles produced by varied LLMs (e.g., GPT, Gemini). Twenty teams par- ticipated, with top systems achieving F1 scores of 0.93 for Arabic and 0.89 for Urdu. The re- sults highlight the dominance of multilingual transformers-specifically XLM-RoBERTa and DeBERTa-v3-and reveal significant challenges in cross-domain generalization, where naive data augmentation often yielded diminishing returns. This shared task establishes a robust baseline for authenticating content in the Abjad ecosystem.

2025

Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)
Saad Ezzini | Hamza Alami | Ismail Berrada | Abdessamad Benlahbib | Abdelkader El Mahdaouy | Salima Lamsiyah | Hatim Derrouz | Amal Haddad Haddad | Mustafa Jarrar | Mo El-Haj | Ruslan Mitkov | Paul Rayson
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)

Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects
Maram Alharbi | Salmane Chafik | Saad Ezzini | Ruslan Mitkov | Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects

M-DAIGT: A Shared Task on Multi-Domain Detection of AI-Generated Text
Salima Lamsiyah | Saad Ezzini | Abdelkader El Mahdaouy | Hamza Alami | Abdessamad Benlahbib | Samir El amrany | Salmane Chafik | Hicham Hammouchi
Proceedings of the Shared Task on Multi-Domain Detection of AI-Generated Text

The generation of highly fluent text by Large Language Models (LLMs) poses a significant challenge to information integrity and academic research. In this paper, we introduce the Multi-Domain Detection of AI-Generated Text (M-DAIGT) shared task, which focuses on detecting AI-generated text across multiple domains, particularly in news articles and academic writing. M-DAIGT comprises two binary classification subtasks: News Article Detection (NAD) (Subtask 1) and Academic Writing Detection (AWD) (Subtask 2). To support this task, we developed and released a new large-scale benchmark dataset of 30,000 samples, balanced between human-written and AI-generated texts. The AI-generated content was produced using a variety of modern LLMs (e.g., GPT-4, Claude) and diverse prompting strategies. A total of 46 unique teams registered for the shared task, of which four teams submitted final results. All four teams participated in both Subtask 1 and Subtask 2. We describe the methods employed by these participating teams and briefly discuss future directions for M-DAIGT.

The AraGenEval Shared Task on Arabic Authorship Style Transfer and AI Generated Text Detection
Shadi Abudalfa | Saad Ezzini | Ahmed Abdelali | Hamza Alami | Abdessamad Benlahbib | Salmane Chafik | Mo El-Haj | Abdelkader El Mahdaouy | Mustafa Jarrar | Salima Lamsiyah | Hamzah Luqman
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

We present an overview of the AraGenEval shared task, organized as part of the ArabicNLP 2025 conference. This task introduced the first benchmark suite for Arabic authorship analysis, featuring three subtasks: Authorship Style Transfer, Authorship Identification, and AI-Generated Text Detection. We curated high-quality datasets, including over 47,000 paragraphs from 21 authors and a balanced corpus of human- and AI-generated texts. The task attracted significant global participation, with 72 registered teams from 16 countries. The results highlight the effectiveness of transformer-based models, with top systems leveraging prompt engineering for style transfer, model ensembling for authorship identification, and a mix of multilingual and Arabic-specific models for AI text detection. This paper details the task design, datasets, participant systems, and key findings, establishing a foundation for future research in Arabic stylistics and trustworthy NLP.

Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija
Salmane Chafik | Saad Ezzini | Ismail Berrada
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)

The task of converting natural language questions into executable SQL queries, known as text-to-SQL, has gained significant interest in recent years, as it enables non-technical users to interact with relational databases. Many benchmarks, such as SPIDER and WikiSQL, have contributed to the development of new models and the evaluation of their performance. In addition, other datasets, like SEDE and BIRD, have introduced more challenges and complexities to better map real-world scenarios. However, these datasets primarily focus on high-resource languages such as English and Chinese. In this work, we introduce Dialect2SQL, the first large-scale, cross-domain text-to-SQL dataset in an Arabic dialect. It consists of 9,428 NLQ-SQL pairs across 69 databases in various domains. Along with SQL-related challenges such as long schemas, dirty values, and complex queries, our dataset also incorporates the complexities of the Moroccan dialect, which is known for its diverse source lan-guages, numerous borrowed words, and unique expressions. This demonstrates that our dataset will be a valuable contribution to both the text-to-SQL community and the development of resources for low-resource languages.

Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
Mo El-Haj | Amal Haddad | Cynthia Amol | Sina Ahmadi | Hugh Paterson III | Ignatius Ezeani | Saad Ezzini | Paul Rayson
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script

Proceedings of the Shared Task on Multi-Domain Detection of AI-Generated Text
Salima Lamsiyah | Saad Ezzini | Abdelkader El Mahdaoui | Hamza Alami | Abdessamad Benlahbib | Samir El Amrani | Salmane Chafik | Hicham Hammouchi
Proceedings of the Shared Task on Multi-Domain Detection of AI-Generated Text

SynFix: Dependency-Aware Program Repair via RelationGraph Analysis
Xunzhu Tang | Jiechao Gao | Jin Xu | Tiezhu Sun | Yewei Song | Saad Ezzini | Wendkûuni C. Ouédraogo | Jacques Klein | Tegawendé F. Bissyandé
Findings of the Association for Computational Linguistics: ACL 2025

Recent advancements in large language models (LLMs) have significantly improved software development automation, including bug localization, code synthesis, program repair, and test generation. However, most prior work on program repair focuses on isolated elements, such as classes or functions, neglecting their interdependencies, which limits repair accuracy. We present SynFix, a RelationGraph-based approach that integrates LLMs with structural search and synchronization techniques for coordinated program repair across codebases. SynFix constructs a RelationGraph to capture relationships among classes, functions, variables, and their interactions (e.g., imports, inheritance, dependencies). Each RelationGraph node includes detailed code descriptions to help LLMs understand root causes and retrieve relevant contexts. By analyzing one-hop nodes in the RelationGraph, SynFixensures repairs account for dependent updates across components. Patch validation is conducted using regression tests from the SWE-bench benchmark suite. Evaluated on SWE-bench datasets, SynFix resolves 52.33% of issues in SWE-bench-lite (300 GitHub issues), 55.8% in SWE-bench-verified (500 issues), and 29.86% in SWE-bench-full (2,294 issues), outperforming baselines such as Swe-Agent, Agentless and AutoCodeRover. The codebase is available at https://anonymous.4open.science/r/AutoFix-EC86/.

ADOR: Dataset for Arabic Dialects in Hotel Reviews: A Human Benchmark for Sentiment Analysis
Maram I. Alharbi | Saad Ezzini | Hansi Hettiarachchi | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages

Arabic machine translation remains a fundamentally challenging task, primarily due to the lack of comprehensive annotated resources. This study evaluates the performance of Meta’s NLLB-200 model in translating Modern Standard Arabic (MSA) into three regional dialects: Saudi, Maghribi, and Egyptian Arabic using a manually curated dataset of hotel reviews. We applied a multi-criteria human annotation framework to assess translation correctness, dialect accuracy, and sentiment and aspect preservation. Our analysis reveals significant variation in translation quality across dialects. While sentiment and aspect preservation were generally high, dialect accuracy and overall translation fidelity were inconsistent. For Saudi Arabic, over 95% of translations required human correction, highlighting systemic issues. Maghribi outputs demonstrated better dialectal authenticity, while Egyptian translations achieved the highest reliability with the lowest correction rate and fewest multi-criteria failures. These results underscore the limitations of current multilingual models in handling informal Arabic varieties and highlight the importance of dialect-sensitive evaluation.

AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP
Ahmed Abul Hasanaath | Aisha Alansari | Ahmed Ashraf | Salmane Chafik | Hamzah Luqman | Saad Ezzini
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks—boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at https://anonymous.4open.science/r/AraReasoner41299

AHaSIS: Shared Task on Sentiment Analysis for Arabic Dialects
Maram I. Alharbi | Salmane Chafik | Saad Ezzini | Ruslan Mitkov | Tharindu Ranasinghe | Hansi Hettiarachchi
Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects

The hospitality industry in the Arab world increasingly relies on customer feedback to shape services, driving the need for advanced Arabic sentiment analysis tools. To address this challenge, the Sentiment Analysis on Arabic Dialects in the Hospitality Domain shared task focuses on Sentiment Detection in Arabic Dialects. This task leverages a multi-dialect, manually curated dataset derived from hotel reviews originally written in Modern Standard Arabic (MSA) and translated into Saudi and Moroccan (Darija) dialects. The dataset consists of 538 sentiment-balanced reviews spanning positive, neutral, and negative categories. Translations were validated by native speakers to ensure dialectal accuracy and sentiment preservation. This resource supports the development of dialect-aware NLP systems for real-world applications in customer experience analysis. More than 40 teams have registered for the shared task, with 12 submitting systems during the evaluation phase. The top-performing system achieved an F1 score of 0.81, demonstrating the feasibility and ongoing challenges of sentiment analysis across Arabic dialects.

Evaluating Large Language Models on Sentiment Analysis in Arabic Dialects
Maram I. Alharbi | Saad Ezzini | Hansi Hettiarachchi | Tharindu Ranasinghe | Ruslan Mitkov
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Despite recent progress in large language models (LLMs), their performance on Arabic dialects remains underexplored, particularly in the context of sentiment analysis. This study presents a comparative evaluation of three LLMs, DeepSeek-R1, Qwen2.5, and LLaMA-3, on sentiment classification across Modern Standard Arabic (MSA), Saudi dialect and Darija. We construct a balanced sentiment dataset by translating and validating MSA hotel reviews into Saudi dialect and Darija. Using parameter-efficient fine-tuning (LoRA) and dialect-specific prompts, we assess each model under matched and mismatched prompting conditions. Evaluation results show that Qwen2.5 achieves the highest macro F1 score of 79% on Darija input using MSA prompts, while DeepSeek performs best when prompted in the input dialect, reaching 71% on Saudi dialect. LLaMA-3 exhibits stable performance across prompt variations, with 75% macro F1 on Darija input under MSA prompting. Dialect-aware prompting consistently improves classification accuracy, particularly for neutral and negative sentiment classes.

2024

CodeAgent: Autonomous Communicative Agents for Code Review
Xunzhu Tang | Kisub Kim | Yewei Song | Cedric Lothritz | Bei Li | Saad Ezzini | Haoye Tian | Jacques Klein | Tegawendé F. Bissyandé
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Code review, which aims at ensuring the overall quality and reliability of software, is a cornerstone of software development. Unfortunately, while crucial, Code review is a labor-intensive process that the research community is looking to automate. Existing automated methods rely on single input-output generative models and thus generally struggle to emulate the collaborative nature of code review. This work introduces CodeAgent, a novel multi-agent Large Language Model (LLM) system for code review automation. CodeAgent incorporates a supervisory agent, QA-Checker, to ensure that all the agents’ contributions address the initial review question. We evaluated CodeAgent on critical code review tasks: (1) detect inconsistencies between code changes and commit messages, (2) identify vulnerability introductions, (3) validate code style adherence, and (4) suggest code revisions. The results demonstrate CodeAgent’s effectiveness, contributing to a new state-of-the-art in code review automation. Our data and code are publicly available (https://github.com/Daniel4SE/codeagent).

Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security
Ruslan Mitkov | Saad Ezzini | Tharindu Ranasinghe | Ignatius Ezeani | Nouran Khallaf | Cengiz Acarturk | Matthew Bradbury | Mo El-Haj | Paul Rayson
Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security

The Multilingual Corpus of World’s Constitutions (MCWC)
Mo El-Haj | Saad Ezzini
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

The “Multilingual Corpus of World’s Constitutions” (MCWC) serves as a valuable resource for the NLP community, offering a comprehensive collection of constitutions from around the world. Its focus on data quality and breadth of coverage enables advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. The MCWC prepares its data to ensure high quality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments across languages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on the MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. The MCWC’s rich multilingual content and rigorous data quality standards raise the bar for legal text analysis and inspire innovation in the NLP community, opening new avenues for studying constitutional texts and multilingual data analysis.

AraFinNLP 2024: The First Arabic Financial NLP Shared Task
Sanad Malaysha | Mo El-Haj | Saad Ezzini | Mohammed Khalilia | Mustafa Jarrar | Sultan Almujaiwel | Ismail Berrada | Houda Bouamor
Proceedings of the Second Arabic Natural Language Processing Conference

The expanding financial markets of the Arab world require sophisticated Arabic NLP tools. To address this need within the banking domain, the Arabic Financial NLP (AraFinNLP) shared task proposes two subtasks: (i) Multi-dialect Intent Detection and (ii) Cross-dialect Translation and Intent Preservation. This shared task uses the updated ArBanking77 dataset, which includes about 39k parallel queries in MSA and four dialects. Each query is labeled with one or more of a common 77 intents in the banking domain. These resources aim to foster the development of robust financial Arabic NLP, particularly in the areas of machine translation and banking chat-bots.A total of 45 unique teams registered for this shared task, with 11 of them actively participated in the test phase. Specifically, 11 teams participated in Subtask 1, while only 1 team participated in Subtask 2. The winning team of Subtask 1 achieved F1 score of 0.8773, and the only team submitted in Subtask 2 achieved a 1.667 BLEU score.

2023

Evaluating the Impact of Text De-Identification on Downstream NLP Tasks
Cedric Lothritz | Bertrand Lebichot | Kevin Allix | Saad Ezzini | Tegawendé Bissyandé | Jacques Klein | Andrey Boytsov | Clément Lefebvre | Anne Goujon
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Data anonymisation is often required to comply with regulations when transfering information across departments or entities. However, the risk is that this procedure can distort the data and jeopardise the models built on it. Intuitively, the process of training an NLP model on anonymised data may lower the performance of the resulting model when compared to a model trained on non-anonymised data. In this paper, we investigate the impact of de-identification on the performance of nine downstream NLP tasks. We focus on the anonymisation and pseudonymisation of personal names and compare six different anonymisation strategies for two state-of-the-art pre-trained models. Based on these experiments, we formulate recommendations on how the de-identification should be performed to guarantee accurate NLP models. Our results reveal that de-identification does have a negative impact on the performance of NLP models, but this impact is relatively low. We also find that using pseudonymisation techniques involving random names leads to better performance across most tasks.

Comparing Pre-Training Schemes for Luxembourgish BERT Models
Cedric Lothritz | Saad Ezzini | Christoph Purschke | Tegawendé Bissyandé | Jacques Klein | Isabella Olariu | Andrey Boytsov | Clément LeFebvre | Anne Goujon
Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)

Co-authors

Jacques Klein 5

Tharindu Ranasinghe 5

Ahmed Abdelali 4

Abdessamad Benlahbib 4

Hansi Hettiarachchi 4

Salima Lamsiyah 4

Cedric Lothritz 4

Maram I. Alharbi 3

Ismail Berrada 3

Tegawendé F. Bissyandé 3

Nadir Durrani 3

Abdelkader El Mahdaouy 3

Ignatius Ezeani 3

Hassan Sajjad 3

Tegawendé Bissyandé 2

Andrey Boytsov 2

Amal Haddad Haddad 2

Hicham Hammouchi 2

Clément Lefebvre 2

Hamzah Luqman 2

Ahmad Abdelali 1

Cengiz Acartürk 1

Aisha Alansari 1

Maram Alharbi 1

Sultan Almujaiwel 1

Houda Bouamor 1

Matthew Bradbury 1

Hatim Derrouz 1

Samir El Amrani 1

Abdelkader El Mahdaoui 1

Samir El-amrany 1

Niccolo’ Gentile 1

Ahmed Abul Hasanaath 1

Mohammed Khalilia 1

Nouran Khallaf 1

Bertrand Lebichot 1

Sanad Malaysha 1

Isabella Olariu 1

Wendkûuni C. Ouédraogo 1

Hugh Paterson III 1

Christoph Purschke 1

Venues