Workshop on Arabic Corpus Linguistics (2025)


up

pdf (full)
bib (full)
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)

pdf bib
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)
Saad Ezzini | Hamza Alami | Ismail Berrada | Abdessamad Benlahbib | Abdelkader El Mahdaouy | Salima Lamsiyah | Hatim Derrouz | Amal Haddad Haddad | Mustafa Jarrar | Mo El-Haj | Ruslan Mitkov | Paul Rayson

pdf bib
ArabicSense: A Benchmark for Evaluating Commonsense Reasoning in Arabic with Large Language Models
Salima Lamsiyah | Kamyar Zeinalipour | Samir El amrany | Matthias Brust | Marco Maggini | Pascal Bouvry | Christoph Schommer

Recent efforts in natural language processing (NLP) commonsense reasoning research have led to the development of numerous new datasets and benchmarks. However, these resources have predominantly been limited to English, leaving a gap in evaluating commonsense reasoning in other languages. In this paper, we introduce the ArabicSense Benchmark, which is designed to thoroughly evaluate the world-knowledge commonsense reasoning abilities of large language models (LLMs) in Arabic. This benchmark includes three main tasks: first, it tests whether a system can distinguish between natural language statements that make sense and those that do not; second, it requires a system to identify the most crucial reason why a nonsensical statement fails to make sense; and third, it involves generating explanations for why statements do not make sense. We evaluate several Arabic BERT-based models and causal LLMs on these tasks. Experimental results demonstrate improvements after fine-tuning on our dataset. For instance, AraBERT v2 achieved an 87% F1 score on the second task, while Gemma and Mistral-7b achieved F1 scores of 95.5% and 94.8%, respectively. For the generation task, LLaMA-3 achieved the best performance with a BERTScore F1 of 77.3%, closely followed by Mistral-7b at 77.1%. All codes and the benchmark will be made publicly available at https://github.com/.

pdf bib
Lahjawi: Arabic Cross-Dialect Translator
Mohamed Motasim Hamed | Muhammad Hreden | Khalil Hennara | Zeina Aldallal | Sara Chrouf | Safwan AlModhayan

In this paper, we explore the rich diversity of Arabic dialects by introducing a suite of pioneering models called Lahjawi. The primary model, Lahjawi-D2D, is the first designed for cross-dialect translation among 15 Arabic dialects. Furthermore, we introduce Lahjawi-D2MSA, a model designed to convert any Arabic dialect into Modern Standard Arabic (MSA). Both models are fine-tuned versions of Kuwain-1.5B an in-house built small language model, tailored for Arabic linguistic characteristics. We provide a detailed overview of Lahjawi’s architecture and training methods, along with a comprehensive evaluation of its performance. The results demonstrate Lahjawi’s success in preserving meaning and style, with BLEU scores of 9.62 for dialect-to-MSA and 9.88 for dialect-to- dialect tasks. Additionally, human evaluation reveals an accuracy score of 58% and a fluency score of 78%, underscoring Lahjawi’s robust handling of diverse dialectal nuances. This research sets a foundation for future advancements in Arabic NLP and cross-dialect communication technologies.

pdf bib
Lost in Variation: An Unsupervised Methodology for Mining Lexico-syntactic Patterns in Middle Arabic Texts
Julien JB Bezançon | Rimane Karam | Gaël Lejeune

While MSA and some dialects of Arabic have been extensively studied in NLP, Middle Arabic is still very much unknown to the field. However, Middle Arabic holds issues that are still not covered: it is characterized by variation since it mixes standard features, colloquial ones, as well as features that belong to neither of the two. Here, we introduce a methodology to identify, extract and rank variations of 13 manually retrieved formulas. Those formulas come from the nine first booklets of S ̄IRAT AL-MALIK AL-Z. ̄AHIR BAYBAR S., a corpus of Damascene popular literature written in Middle Arabic and composed of 53,843 sentences. In total, we ranked 20, sequences according to their similarity with the original formulas on multiple linguistic layers. We noticed that the variations in these formulas occur in a lexical, morphological and graphical level, but in opposition, the semantic and syntactic levels remain strictly invariable.

pdf bib
SADSLyC: A Corpus for Saudi Arabian Multi-dialect Identification through Song Lyrics
Salwa Saad Alahmari

This paper presents the Saudi Arabian Dialects Song Lyrics Corpus (SADSLyC), the first dataset featuring song lyrics from the five major Saudi dialects: Najdi (Central Region), Hijazi (Western Region), Shamali (Northern Region), Janoubi (Southern Region), and Shargawi (Eastern Region). The dataset consists of 31,358 sentences, with each sentence representing a self-contained verse in a song, totaling 151,841 words. Additionally, we present a baseline experiment using the SaudiBERT model to classify the fine-grained dialects in the SADSLyC Corpus. The model achieved an overall accuracy of 73% on the test dataset.

pdf bib
Enhancing Dialectal Arabic Intent Detection through Cross-Dialect Multilingual Input Augmentation
Shehenaz Hossain | Fouad Shammary | Bahaulddin Shammary | Haithem Afli

Addressing the challenges of Arabic intent detection amid extensive dialectal variation, this study presents a crossdialtectal, multilingual approach for classifying intents in banking and migration contexts. By augmenting dialectal inputs with Modern Standard Arabic (MSA) and English translations, our method leverages cross-lingual context to improve classification accuracy. We evaluate single-input (dialect-only), dual-input (dialect + MSA), and triple-input (dialect + MSA + English) models, applying language-specific tokenization for each. Results demonstrate that, in the migration dataset, our model achieved an accuracy gain of over 50% on Tunisian dialect, increasing from 43.3% with dialect-only input to 94% with the full multilingual setup. Similarly, in the PAL (Palestinian dialect) dataset, accuracy improved from 87.7% to 93.5% with translation augmentation, reflecting a gain of 5.8 percentage points. These findings underscore the effectiveness of our approach for intent detection across various Arabic dialects.

pdf bib
Dial2MSA-Verified: A Multi-Dialect Arabic Social Media Dataset for Neural Machine Translation to Modern Standard Arabic
Abdullah Salem Khered | Youcef Benkhedda | Riza Batista-Navarro

Social media has become an essential focus for Natural Language Processing (NLP) research due to its widespread use and unique linguistic characteristics. Normalising social media content, especially for morphologically rich languages like Arabic, remains a complex task due to limited parallel corpora. Arabic encompasses Modern Standard Arabic (MSA) and various regional dialects, collectively termed Dialectal Arabic (DA), which complicates NLP efforts due to their informal nature and variability. This paper presents Dial2MSA-Verified, an extension of the Dial2MSA dataset that includes verified translations for Gulf, Egyptian, Levantine, and Maghrebi dialects. We evaluate the performance of Seq2Seq models on this dataset, highlighting the effectiveness of state-of-the-art models in translating local Arabic dialects. We also provide insights through error analysis and outline future directions for enhancing Seq2Seq models and dataset development. The Dial2MSA-Verified dataset is publicly available to support further research.

pdf bib
Web-Based Corpus Compilation of the Emirati Arabic Dialect
Yousra A. El-Ghawi

This paper displays some initial efforts conducted in the compilation pursuits of Arabic dialectal corpora in the form of raw text, the end purpose of which is to fine-tune existing Arabic large language models (LLM) to better understand and generate text in the Emirati dialect as instructed. The focus of the paper is on the process of compiling corpora from the web, which includes the exploration of possible methods, tools and techniques specific to web search, as well as examples of genres and domains to explore. The results of these efforts and the importance of native speaker contributions to corpus compilation for low-resource languages are also touched upon.

pdf bib
Evaluating Calibration of Arabic Pre-trained Language Models on Dialectal Text
Ali Al-Laith | Rachida Kebdani

While pre-trained language models have made significant progress in different classification tasks, little attention has been given to the reliability of their confidence scores. Calibration, how well model confidence aligns with actual accuracy, is essential for real-world applications where decisions rely on probabilistic outputs. This study addresses this gap in Arabic dialect identification by assessing the calibration of eight pre-trained language models, ensuring their predictions are not only accurate but also reliable for practical applications. We analyze two datasets: one with over 1 million text samples and the Nuanced Arabic Dialect Identification dataset(NADI-2023). Using Expected Calibration Error (ECE) as a metric, we reveal substantial variation in model calibration across dialects in both datasets, showing that prediction confidence can vary significantly depending on regional data. This research has implications for improving the reliability of Arabic dialect models in applications like sentiment analysis and social media monitoring.

pdf bib
Empirical Evaluation of Pre-trained Language Models for Summarizing Moroccan Darija News Articles
Azzedine Aftiss | Salima Lamsiyah | Christoph Schommer | Said Ouatik El Alaoui

Moroccan Dialect (MD), or “Darija,” is a primary spoken variant of Arabic in Morocco, yet remains underrepresented in Natural Language Processing (NLP) research, particularly in tasks like summarization. Despite a growing volume of MD textual data online, there is a lack of robust resources and NLP models tailored to handle the unique linguistic challenges posed by MD. In response, we introduce .MA_v2, an expanded version of the GOUD.MA dataset, containing over 50k articles with their titles across 11 categories. This dataset provides a more comprehensive resource for developing summarization models. We evaluate the application of large language models (LLMs) for MD summarization, utilizing both fine-tuning and zero-shot prompting with encoder-decoder and causal LLMs, respectively. Our findings demonstrate that an expanded dataset improves summarization performance and highlights the capabilities of recent LLMs in handling MD text. We open-source our dataset, fine-tuned models, and all experimental code, establishing a foundation for future advancements in MD NLP. We release the code at https://github.com/AzzedineAftiss/Moroccan-Dialect-Summarization.

pdf bib
Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija
Salmane Chafik | Saad Ezzini | Ismail Berrada

The task of converting natural language questions into executable SQL queries, known as text-to-SQL, has gained significant interest in recent years, as it enables non-technical users to interact with relational databases. Many benchmarks, such as SPIDER and WikiSQL, have contributed to the development of new models and the evaluation of their performance. In addition, other datasets, like SEDE and BIRD, have introduced more challenges and complexities to better map real-world scenarios. However, these datasets primarily focus on high-resource languages such as English and Chinese. In this work, we introduce Dialect2SQL, the first large-scale, cross-domain text-to-SQL dataset in an Arabic dialect. It consists of 9,428 NLQ-SQL pairs across 69 databases in various domains. Along with SQL-related challenges such as long schemas, dirty values, and complex queries, our dataset also incorporates the complexities of the Moroccan dialect, which is known for its diverse source lan-guages, numerous borrowed words, and unique expressions. This demonstrates that our dataset will be a valuable contribution to both the text-to-SQL community and the development of resources for low-resource languages.

pdf bib
AraSim: Optimizing Arabic Dialect Translation in Children’s Literature with LLMs and Similarity Scores
Alaa Hassan Bouomar | Noorhan Abbas

The goal of the paper is to address the linguistic gap faced by young Egyptian Arabic speakers through translating children stories from Modern Standard Arabic to the Egyptian Cairo dialect. Claude is used for initial translation, and a fine-tuned AraT5 model is used for backtranslation. The translation quality is assessed using semantic similarity and BLUE scores to compare the original texts and the translations. The resulting corpus contains 130 stories which were revised by native Egyptian speakers who are professional translators. The strengths of this paper are multiple: working on a less-resourced variety, addressing an important social issue, creating a dataset with potential real-life applications, and ensuring the quality of the produced dataset through human validation.

pdf bib
Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection
Ahmed Haj Ahmed | Rui-Jie Yew | Xerxes Minocher | Suresh Venkatasubramanian

Social media platforms have become central to global communication, yet they also facilitate the spread of hate speech. For underrepresented dialects like Levantine Arabic, detecting hate speech presents unique cultural, ethical, and linguistic challenges. This paper explores the complex sociopolitical and linguistic landscape of Levantine Arabic and critically examines the limitations of current datasets used in hate speech detection. We highlight the scarcity of publicly available, diverse datasets and analyze the consequences of dialectal bias within existing resources. By emphasizing the need for culturally and contextually informed natural language processing (NLP) tools, we advocate for a more nuanced and inclusive approach to hate speech detection in the Arab world.