Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script

Mo El-Haj (Editor)


Anthology ID:
2025.abjadnlp-1
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Venues:
AbjadNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2025.abjadnlp-1/
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2025.abjadnlp-1.pdf

pdf bib
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
Mo El-Haj

pdf bib
The Best of Both Worlds: Exploring Wolofal in the Context of NLP
Ngoc Tan Le | Ali Mijiyawa | Abdoulahat Leye | Fatiha Sadat

This paper examines the three writing systems used for the Wolof language: the Latin script, the Ajami script (Wolofal), and the Garay script. Although the Latin alphabet is now the official standard for writing Wolof in Senegal, Garay and Ajami still play an important cultural and religious role, especially the latter. This article focuses specifically on Ajami, a system based on the Arabic script, and describes its history, its use, and its modern writings. We also analyze the challenges and prospects of these systems from the perspective of language preservation.

pdf bib
MultiProp Framework: Ensemble Models for Enhanced Cross-Lingual Propaganda Detection in Social Media and News using Data Augmentation, Text Segmentation, and Meta-Learning
Farizeh Aldabbas | Shaina Ashraf | Rafet Sifa | Lucie Flek

Propaganda, a pervasive tool for influenc- ing public opinion, demands robust auto- mated detection systems, particularly for under- resourced languages. Current efforts largely focus on well-resourced languages like English, leaving significant gaps in languages such as Arabic. This research addresses these gaps by introducing MultiProp Framework, a cross- lingual meta-learning framework designed to enhance propaganda detection across multiple languages, including Arabic, German, Italian, French and English. We constructed a mul- tilingual dataset using data translation tech- niques, beginning with Arabic data from PTC and WANLP shared tasks, and expanded it with translations into German Italian and French, further enriched by the SemEval23 dataset. Our proposed framework encompasses three distinct models: MultiProp-Baseline, which combines ensembles of pre-trained models such as GPT-2, mBART, and XLM-RoBERTa; MultiProp-ML, designed to handle languages with minimal or no training data by utiliz- ing advanced meta-learning techniques; and MultiProp-Chunk, which overcomes the chal- lenges of processing longer texts that exceed the token limits of pre-trained models. To- gether, they deliver superior performance com- pared to state-of-the-art methods, representing a significant advancement in the field of cross- lingual propaganda detection.

pdf bib
Towards Unified Processing of Perso-Arabic Scripts for ASR
Srihari Bandarupalli | Bhavana Akkiraju | Sri Charan Devarakonda | Harinie Sivaramasethu | Vamshiraghusimha Narasinga | Anil Vuppala

Automatic Speech Recognition (ASR) systems for morphologically complex languages like Urdu, Persian, and Arabic face unique challenges due to the intricacies of Perso-Arabic scripts. Conventional data processing methods often fall short in effectively handling these languages’ phonetic and morphological nuances. This paper introduces a unified data processing pipeline tailored specifically for Perso-Arabic languages, addressing the complexities inherent in these scripts. The proposed pipeline encompasses comprehensive steps for data cleaning, tokenization, and phonemization, each of which has been meticulously evaluated and validated by expert linguists. Through expert-driven refinements, our pipeline presents a robust foundation for advancing ASR performance across Perso-Arabic languages, supporting the development of more accurate and linguistically informed multilingual ASR systems in future.

pdf bib
In-Depth Analysis of Arabic-Origin Words in the Turkish Morpholex
Mounes Zaval | Abdullah İhsanoğlu | Asım Ersoy | Olcay Taner Yıldız

MorphoLex is an investigation that focuses on analyzing the roots, prefixes, and suffixes of words. Turkish Morpholex, for example, analyzes 48,472 Turkish words. Unfortunately, it lacks in-depth analysis of the Arabic-origin words, and does not include their accurate and correct roots. This study analyzes Arabic-origin words in the Turkish Morpholex, annotating their roots, morphological patterns, and semantic categories. The methodology developed for this work is adaptable to other languages influenced by Arabic, such as Urdu and Persian, offering broader implications for studying loanword integration across linguistic contexts.

pdf bib
DadmaTools V2: an Adapter-Based Natural Language Processing Toolkit for the Persian Language
Sadegh Jafari | Farhan Farsi | Navid Ebrahimi | Mohamad Bagher Sajadi | Sauleh Eetemadi

DadmaTools V2 is a comprehensive repository designed to enhance NLP capabilities for the Persian language, catering to industry practitioners seeking practical and efficient solutions. The toolkit provides extensive code examples demonstrating the integration of its models with popular NLP frameworks such as Trankit and Transformers, as well as deep learning frameworks like PyTorch. Additionally, DadmaTools supports widely used Persian embeddings and datasets, ensuring robust language processing capabilities. The latest version of DadmaTools introduces an adapter-based technique, significantly reducing memory usage by employing a shared pre-trained model across various tasks, supplemented with task-specific adapter layers. This approach eliminates the need to maintain multiple pre-trained models and optimize resource utilization. Enhancements in this version include adding new modules such as a sentiment detector, an informal-to-formal text converter, and a spell checker, further expanding the toolkit’s functionality. DadmaTools V2 thus represents a powerful, efficient, and versatile resource for advancing Persian NLP applications.

pdf bib
Developing an Informal-Formal Persian Corpus: Highlighting the Differences between Two Writing Styles
Vahide Tajalli | Mehrnoush Shamsfard | Fateme Kalantari

Informal language is a style of spoken or written language frequently used in casual conversations, social media, weblogs, emails and text messages. In informal writing, the language undergoes some lexical and/or syntactic changes varying among different languages. Persian is one of the languages with many differences between its formal and informal styles of writing, thus developing informal language processing tools for this language seems necessary. In the present paper, the methodology in building a parallel corpus of 50,000 sentence pairs with alignments in the word/phrase level is described. The resulting corpus has about 530,000 alignments and a dictionary containing 49,397 word and phrase pairs. The observed differences between formal and informal writing are explained in detail.

pdf bib
Boosting Sentiment Analysis in Persian through a GAN-Based Synthetic Data Augmentation Method
Masoumeh Mohammadi | Mohammad Ruhul Amin | Shadi Tavakoli

This paper presents a novel Sentiment Analysis (SA) dataset in the low-resource Persian language including a data augmentation technique using Generative Adversarial Networks (GANs) to generate synthetic data, boosting the volume and variety of data, for achieving state-of-the-art performance. We propose a novel annotated SA dataset, called Senti-Persian, made of 67,743 public comments on movie reviews from Iranian websites (Namava, Filimo and Aparat) and social media (YouTube, Twitter and Instagram). These reviews are labeled with one of the polarity labels, namely positive, negative, and neutral. Our study includes a novel text augmentation model based on GANs. The generator was designed following the linguistic properties of Persian linguistics, while the discriminator was designed based on the cosine similarity of the vectorized original and generated sentences, i.e. using CLS-embedings of BERT. A SA task applied on both collected and augmented datasets for which we observed a significant improvement in the accuracy from 88.4% for the original dataset to the 96% when augmented with synthetic data. Senti-Parsian dataset including the original and the augmented ones will be available on github.

pdf bib
Psychological Health Chatbot, Detecting and Assisting Patients in their Path to Recovery
Sadegh Jafari | Mohammad Erfan Zare | Amireza Vishte | Mirzae Melike | Zahra Amiri | Sima Mohammadparast | Sauleh Eetemadi

Mental health disorders such as stress, anxiety, and depression are increasingly prevalent globally, yet access to care remains limited due to barriers like geographic isolation, financial constraints, and stigma. Conversational agents or chatbots have emerged as viable digital tools for personalized mental health support. This paper presents the development of a psychological health chatbot designed specifically for Persian-speaking individuals, offering a culturally sensitive tool for emotion detection and disorder identification. The chatbot integrates several advanced natural language processing (NLP) modules, leveraging the ArmanEmo dataset to identify emotions, assess psychological states, and ensure safe, appropriate responses. Our evaluation of various models, including ParsBERT and XLM-RoBERTa, demonstrates effective emotion detection with accuracy up to 75.39%. Additionally, the system incorporates a Large Language Model (LLM) to generate messages. This chatbot serves as a promising solution for addressing the accessibility gap in mental health care and provides a scalable, language-inclusive platform for psychological support.

pdf bib
A Derivational ChainBank for Modern Standard Arabic
Reham Marzouk | Sondos Krouna | Nizar Habash

We introduce the new concept of an Arabic Derivational Chain Bank (CHAINBANK) to leverage the relationship between form and meaning in modeling Arabic derivational morphology. We constructed a knowledge graph network of abstract patterns and their derivational relations, and aligned it with the lemmas of the CAMELMORPH morphological analyzer database. This process produced chains of derived words’ lemmas linked to their correspond- ing lemma bases through derivational relations, encompassing 23,333 derivational connections. The CHAINBANK is publicly available.1

pdf bib
Sentiment Analysis of Arabic Tweets Using Large Language Models
Pankaj Dadure | Ananya Dixit | Kunal Tewatia | Nandini Paliwal | Anshika Malla

In the digital era, sentiment analysis has become an indispensable tool for understanding public sentiments, optimizing market strategies, and enhancing customer engagement across diverse sectors. While significant advancements have been made in sentiment analysis for high-resource languages such as English, French, etc. This study focuses on Arabic, a low-resource language, to address its unique challenges like morphological complexity, diverse dialects, and limited linguistic resources. Existing works in Arabic sentiment analysis have utilized deep learning architectures like LSTM, BiLSTM, and CNN-LSTM, alongside embedding techniques such as Word2Vec and contextualized models like ARABERT. Building on this foundation, our research investigates sentiment classification of Arabic tweets, categorizing them as positive or negative, using embeddings derived from three large language models (LLMs): Universal Sentence Encoder (USE), XLM-RoBERTa base (XLM-R base), and MiniLM-L12-v2. Experimental results demonstrate that incorporating emojis in the dataset and using the MiniLM embeddings yield an accuracy of 85.98%. In contrast, excluding emojis and using embeddings from the XLM-R base resulted in a lower accuracy of 78.98%. These findings highlight the impact of both dataset composition and embedding techniques on Arabic sentiment analysis performance.

pdf bib
Evaluating Large Language Models on Health-Related Claims Across Arabic Dialects
Abdulsalam obaid Alharbi | Abdullah Alsuhaibani | Abdulrahman Abdullah Alalawi | Usman Naseem | Shoaib Jameel | Salil Kanhere | Imran Razzak

While the Large Language Models (LLMs) have been popular in different tasks, their capability to handle health-related claims in diverse linguistic and cultural contexts, such as Arabic dialects, Saudi, Egyptian, Lebanese, and Moroccan has not been thoroughly explored. To this end, we develop a comprehensive evaluation framework to assess how LLMs particularly GPT-4 respond to health-related claims. Our framework focuses on measuring factual accuracy, consistency, and cultural adaptability. It introduces a new metric, the “Cultural Sensitivity Score”, to evaluate the model’s ability to adjust responses based on dialectal differences. Additionally, the reasoning patterns used by the models are analyzed to assess their effectiveness in engaging with claims across these dialects. Our findings highlight that while LLMs excel in recognizing true claims, they encounter difficulties with mixed and ambiguous claims, especially in underrepresented dialects. This work underscores the importance of dialect-specific evaluations to ensure accurate, contextually appropriate, and culturally sensitive responses from LLMs in real-world applications.

pdf bib
Can LLMs Verify Arabic Claims? Evaluating the Arabic Fact-Checking Abilities of Multilingual LLMs
Ayushman Gupta | Aryan Singhal | Thomas Law | Veekshith Rao | Evan Duan | Ryan Luo Li

Large language models (LLMs) have demonstrated potential in fact-checking claims, yet their capabilities in verifying claims in multilingual contexts remain largely understudied. This paper investigates the efficacy of various prompting techniques, viz. Zero-Shot, English Chain-of-Thought, Self-Consistency, and Cross-Lingual Prompting, in enhancing the fact-checking and claim-verification abilities of LLMs for Arabic claims. We utilize 771 Arabic claims sourced from the X-fact dataset to benchmark the performance of four LLMs. To the best of our knowledge, ours is the first study to benchmark the inherent Arabic fact-checking abilities of LLMs stemming from their knowledge of Arabic facts, using a variety of prompting methods. Our results reveal significant variations in accuracy across different prompting methods. Our findings suggest that Cross-Lingual Prompting outperforms other methods, leading to notable performance gains.

pdf bib
Can LLMs Translate Cultural Nuance in Dialects? A Case Study on Lebanese Arabic
Silvana Yakhni | Ali Chehab

Machine Translation (MT) of Arabic-script languages presents unique challenges due to their vast linguistic diversity and lack of standardization. This paper focuses on the Lebanese dialect, investigating the effectiveness of Large Language Models (LLMs) in handling culturally-aware translations. We identify critical limitations in existing Lebanese-English parallel datasets, particularly their non-native nature and lack of cultural context. To address these gaps, we introduce a new culturally-rich dataset derived from the Language Wave (LW) podcast. We evaluate the performance of LLMs: Jais, AceGPT, Cohere, and GPT-4 models against Neural Machine Translation (NMT) systems: NLLB-200, and Google Translate. Our findings reveal that while both architectures perform similarly on non-native datasets, LLMs demonstrate superior capabilities in preserving cultural nuances when handling authentic Lebanese content. Additionally, we validate xCOMET as a reliable metric for evaluating the quality of Arabic dialect translation, showing a strong correlation with human judgment. This work contributes to the growing field of Culturally-Aware Machine Translation and highlights the importance of authentic, culturally representative datasets in advancing low-resource translation systems.

pdf bib
Automated Generation of Arabic Verb Conjugations with Multilingual Urdu Translation: An NLP Approach
Haq Nawaz | Manal Elobaid | Ali Al-Laith | Saif Ullah

This paper presents a rule-based automated system for generating both Arabic verb conjugations and their corresponding Urdu translations. The system processes triliteral, non-weak Arabic roots across key tenses Past Simple, Past Simple Negative, Present Simple, and Present Simple Negative. Addressing the challenges posed by Arabic morphology, our rule-based approach applies patterns and morphological rules to accurately produce verb conjugations, capturing essential grammatical variations in gender, number, and person. Simultaneously, the system generates Urdu translations using predefined patterns that is aligned with the grammatical nuances of Arabic, ensuring semantic consistency. As the first system of its kind, it uniquely provides a cross-lingual resource that bridges two linguistically similar but distinct languages. By focusing on rule based precision and dual-language outputs, it addresses critical gaps in NLP resources, serving as a valuable tool for linguists, educators, and NLP researchers in academic and religious contexts where Arabic and Urdu coexist.

pdf bib
Evaluation of Large Language Models on Arabic Punctuation Prediction
Asma Ali Al Wazrah | Afrah Altamimi | Hawra Aljasim | Waad Alshammari | Rawan Al-Matham | Omar Elnashar | Mohamed Amin | Abdulrahman AlOsaimy

The linguistic inclusivity of Large Language Models (LLMs) such as ChatGPT, Gemni, JAIS, and AceGPT has not been sufficiently explored, particularly in their handling of low-resource languages like Arabic compared to English. While these models have shown impressive performance across various tasks, their effectiveness in Arabic remains under-examined. Punctuation, critical for sentence structure and comprehension in tasks like speech analysis, synthesis, and machine translation, requires precise prediction. This paper assesses seven LLMs: GPT4-o, Gemni1.5, JAIS, AceGPT, SILMA, ALLaM, and CommandR+ for Arabic punctuation prediction. Additionally, the performance of fine-tuned AraBERT is compared with these models in zero-shot and few-shot settings using a proposed Arabic punctuation prediction corpus of 10,044 sentences. The experiments demonstrate that while AraBERT performs well for specific punctuation marks, LLMs show significant promise in zero-shot learning, with further improvements in few-shot scenarios. These findings highlight the potential of LLMs to enhance the automation and accuracy of Arabic text processing.

pdf bib
Evaluating RAG Pipelines for Arabic Lexical Information Retrieval: A Comparative Study of Embedding and Generation Models
Raghad Al-Rasheed | Abdullah Al Muaddi | Hawra Aljasim | Rawan Al-Matham | Muneera Alhoshan | Asma Al Wazrah | Abdulrahman AlOsaimy

This paper investigates the effectiveness of retrieval-augmented generation (RAG) pipelines, focusing on the Arabic lexical information retrieval. Specifically, it analyzes how embedding models affect the recall of Arabic lexical information and evaluates the ability of large language models (LLMs) to produce accurate and contextually relevant answers within the RAG pipelines. We examine a dataset of over 88,000 words from the Riyadh dictionary and evaluate the models using metrics such as Top-K Recall, Mean Reciprocal Rank (MRR), F1 Score, Cosine Similarity, and Accuracy. The research assesses the capabilities of several embedding models, including E5-large, BGE, AraBERT, CAMeLBERT, and AraELECTRA, highlighting a disparity in performance between sentence embeddings and word embeddings. Sentence embedding with E5 achieved the best results, with a Top-5 Recall of 0.88, and an MRR of 0.48. For the generation models, we evaluated GPT-4, GPT-3.5, SILMA-9B, Gemini-1.5, Aya-8B, and AceGPT-13B based on their ability to generate accurate and contextually appropriate responses. GPT-4 demonstrated the best performance, achieving an F1 score of 0.90, an accuracy of 0.82, and a cosine similarity of 0.87. Our results emphasize the strengths and limitations of both embedding and generation models in Arabic tasks.