Bashar Alhafni - ACL Anthology

Bashar Alhafni

2025

Enhancing Text Editing for Grammatical Error Correction: Arabic as a Case Study
Bashar Alhafni | Nizar Habash
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Text editing frames grammatical error correction (GEC) as a sequence tagging problem, where edit tags are assigned to input tokens, and applying these edits results in the corrected text. This approach has gained attention for its efficiency and interpretability. However, while extensively explored for English, text editing remains largely underexplored for morphologically rich languages like Arabic. In this paper, we introduce a text editing approach that derives edit tags directly from data, eliminating the need for language-specific edits. We demonstrate its effectiveness on Arabic, a diglossic and morphologically rich language, and investigate the impact of different edit representations on model performance. Our approach achieves SOTA results on two Arabic GEC benchmarks and performs on par with SOTA on two others. Additionally, our models are over six times faster than existing Arabic GEC systems, making our approach more practical for real-world applications. Finally, we explore ensemble models, demonstrating how combining different models leads to further performance improvements. We make our code, data, and pretrained models publicly available.

Evaluating Prompt Relevance in Arabic Automatic Essay Scoring: Insights from Synthetic and Real-World Data
Chatrine Qwaider | Kirill Chirkunov | Bashar Alhafni | Nizar Habash | Ted Briscoe
Proceedings of The Third Arabic Natural Language Processing Conference

Prompt relevance is a critical yet underexplored dimension in Arabic Automated Essay Scoring (AES). We present the first systematic study of binary prompt-essay relevance classification, supporting both AES scoring and dataset annotation. To address data scarcity, we built a synthetic dataset of on-topic and off-topic pairs and evaluated multiple models, including threshold-based classifiers, SVMs, causal LLMs, and a fine-tuned masked SBERT model. For real-data evaluation, we combined QAES with ZAEBUC, creating off-topic pairs via mismatched prompts. We also tested prompt expansion strategies using AraVec, CAMeL, and GPT-4o. Our fine-tuned SBERT achieved 98% F1 on synthetic data and strong results on QAES+ZAEBUC, outperforming SVMs and threshold-based baselines and offering a resource-efficient alternative to LLMs. This work establishes the first benchmark for Arabic prompt relevance and provides practical strategies for low-resource AES.

BALSAM: A Platform for Benchmarking Arabic Large Language Models
Rawan Nasser Almatham | Kareem Mohamed Darwish | Raghad Al-Rasheed | Waad Thuwaini Alshammari | Muneera Alhoshan | Amal Almazrua | Asma Al Wazrah | Mais Alheraki | Firoj Alam | Preslav Nakov | Norah A. Alzahrani | Eman Albilali | Nizar Habash | Abdelrahman Mustafa El-Sheikh | Muhammad Elmallah | Hamdy Mubarak | Zaid Alyafeai | Mohamed Anwar | Haonan Li | Ahmed Abdelali | Nora Altwairesh | Maram Hasanain | Abdulmohsen Al-Thubaity | Shady Shehata | Bashar Alhafni | Injy Hamed | Go Inoue | Khalid N. Elmadani | Ossama Obeid | Fatima Haouari | Tamer Elsayed | Emad A. Alghamdi | Khalid Almubarak | Saied Alshahrani | Ola Aljareh | Safa Alajlan | Areej Alshaqarawi | Maryam Alshihri | Sultana Alghurabi | Atikah Alzeghayer | Afrah Altamimi | Abdullah Alfaifi | Abdulrahman M Alosaimy
Proceedings of The Third Arabic Natural Language Processing Conference

The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.

BAREC Shared Task 2025 on Arabic Readability Assessment
Khalid N. Elmadani | Bashar Alhafni | Hanada Taha | Nizar Habash
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

We present the results and findings of the BAREC Shared Task 2025 on Arabic Readability Assessment, organized as part of The Third Arabic Natural Language Processing Conference (ArabicNLP 2025). The BAREC 2025 shared task focuses on automatic readability assessment using BAREC Corpus, addressing fine-grained classification into 19 readability levels. The shared task includes two sub-tasks: sentence-level classification and document-level classification, and three tracks: (1) Strict Track, where only BAREC Corpus is allowed; (2) Constrained Track, restricted to the BAREC Corpus, SAMER Corpus, and SAMER Lexicon, and (3) Open Track, allowing any external resources. A total of 22 teams from 12 countries registered for the task. Among these, 17 teams submitted system description papers. The winning team achieved 87.5 QWK on the sentence-level task and 87.4 QWK on the document-level task.

Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
Ekaterina Kochmar | Bashar Alhafni | Marie Bexte | Jill Burstein | Andrea Horbach | Ronja Laarmann-Quante | Anaïs Tack | Victoria Yaneva | Zheng Yuan
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection
Chatrine Qwaider | Bashar Alhafni | Kirill Chirkunov | Nizar Habash | Ted Briscoe
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Automated Essay Scoring (AES) plays a crucial role in assessing language learners’ writingquality, reducing grading workload, and providing real-time feedback. The lack of annotatedessay datasets inhibits the development of Arabic AES systems. This paper leverages LargeLanguage Models (LLMs) and Transformermodels to generate synthetic Arabic essays forAES. We prompt an LLM to generate essaysacross the Common European Framework ofReference (CEFR) proficiency levels and introduce and compare two approaches to errorinjection. We create a dataset of 3,040 annotated essays with errors injected using our twomethods. Additionally, we develop a BERTbased Arabic AES system calibrated to CEFRlevels. Our experimental results demonstratethe effectiveness of our synthetic dataset in improving Arabic AES performance. We makeour code and data publicly available

ARWI: Arabic Write and Improve
Kirill Chirkunov | Bashar Alhafni | Chatrine Qwaider | Nizar Habash | Ted Briscoe
Proceedings of the Fourth Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2025)

Although Arabic is spoken by over 400 million people, advanced Arabic writing assistance tools remain limited. To address this gap, we present ARWI, a new writing assistant that helps learners improve essay writing in Modern Standard Arabic. ARWI is the first publicly available Arabic writing assistant to include a prompt database for different proficiency levels, an Arabic text editor, state-of-the-art grammatical error detection and correction, and automated essay scoring aligned with the Common European Framework of Reference standards for language attainment (https://arwi.mbzuai.ac.ae/). Moreover, ARWI can be used to gather a growing auto-annotated corpus, facilitating further research on Arabic grammar correction and essay scoring, as well as profiling patterns of errors made by native speakers and non-native learners. A preliminary user study shows that ARWI provides actionable feedback, helping learners identify grammatical gaps, assess language proficiency, and guide improvement.

2024

Exploiting Dialect Identification in Automatic Dialectal Text Normalization
Bashar Alhafni | Sarah Al-Towaity | Ziyad Fawzy | Fatema Nassar | Fadhl Eryani | Houda Bouamor | Nizar Habash
Proceedings of the Second Arabic Natural Language Processing Conference

Dialectal Arabic is the primary spoken language used by native Arabic speakers in daily communication. The rise of social media platforms has notably expanded its use as a written language. However, Arabic dialects do not have standard orthographies. This, combined with the inherent noise in user-generated content on social media, presents a major challenge to NLP applications dealing with Dialectal Arabic. In this paper, we explore and report on the task of CODAfication, which aims to normalize Dialectal Arabic into the Conventional Orthography for Dialectal Arabic (CODA). We work with a unique parallel corpus of multiple Arabic dialects focusing on five major city dialects. We benchmark newly developed pretrained sequence-to-sequence models on the task of CODAfication. We further show that using dialect identification information improves the performance across all dialects. We make our code, data, andpretrained models publicly available.

Strategies for Arabic Readability Modeling
Juan Liberato | Bashar Alhafni | Muhamed Khalil | Nizar Habash
Proceedings of the Second Arabic Natural Language Processing Conference

Automatic readability assessment is relevant to building NLP applications for education, content analysis, and accessibility. However, Arabic readability assessment is a challenging task due to Arabic’s morphological richness and limited readability resources. In this paper, we present a set of experimental results on Arabic readability assessment using a diverse range of approaches, from rule-based methods to Arabic pretrained language models. We report our results on a newly created corpus at different textual granularity levels (words and sentence fragments). Our results show that combining different techniques yields the best results, achieving an overall macro F1 score of 86.7 at the word level and 87.9 at the fragment level on a blind test set. We make our code, data, and pretrained models publicly available.

The SAMER Arabic Text Simplification Corpus
Bashar Alhafni | Reem Hazim | Juan David Pineros Liberato | Muhamed Al Khalil | Nizar Habash
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present the SAMER Corpus, the first manually annotated Arabic parallel corpus for text simplification targeting school-aged learners. Our corpus comprises texts of 159K words selected from 15 publicly available Arabic fiction novels most of which were published between 1865 and 1955. Our corpus includes readability level annotations at both the document and word levels, as well as two simplified parallel versions for each text targeting learners at two different readability levels. We describe the corpus selection process, and outline the guidelines we followed to create the annotations and ensure their quality. Our corpus is publicly available to support and encourage research on Arabic text simplification, Arabic automatic readability assessment, and the development of Arabic pedagogical language technologies.

mEdIT: Multilingual Text Editing via Instruction Tuning
Vipul Raheja | Dimitris Alikaniotis | Vivek Kulkarni | Bashar Alhafni | Dhruv Kumar
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We introduce mEdIT, a multi-lingual extension to CoEdIT – the recent state-of-the-art text editing models for writing assistance. mEdIT models are trained by fine-tuning multi-lingual large, pre-trained language models (LLMs) via instruction tuning. They are designed to take instructions from the user specifying the attributes of the desired text in the form of natural language instructions, such as “Grammatik korrigieren” (German) or “이 텍스 트를 단순화” (Korean). We build mEdIT by curating data from multiple publicly available human-annotated text editing datasets for three text editing tasks (Grammatical Error Correction (GEC), Text Simplification, and Paraphrasing) across diverse languages belonging to six different language families. We detail the design and training of mEdIT models and demonstrate their strong performance on many multi-lingual text editing benchmarks against other multilingual LLMs. We also find that mEdIT generalizes effectively to new languages over multilingual baselines. We publicly release our data, code, and trained models.

Personalized Text Generation with Fine-Grained Linguistic Control
Bashar Alhafni | Vivek Kulkarni | Dhruv Kumar | Vipul Raheja
Proceedings of the 1st Workshop on Personalization of Generative AI Systems (PERSONALIZE 2024)

As the text generation capabilities of large language models become increasingly prominent, recent studies have focused on controlling particular aspects of the generated text to make it more personalized. However, most research on controllable text generation focuses on controlling the content or modeling specific high-level/coarse-grained attributes that reflect authors’ writing styles, such as formality, domain, or sentiment. In this paper, we focus on controlling fine-grained attributes spanning multiple linguistic dimensions, such as lexical and syntactic attributes. We introduce a novel benchmark to train generative models and evaluate their ability to generate personalized text based on multiple fine-grained linguistic attributes. We systematically investigate the performance of various large language models on our benchmark and draw insights from the factors that impact their performance. We make our code, data, models, and benchmarks publicly available.

2023

Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation
Bashar Alhafni | Go Inoue | Christian Khairallah | Nizar Habash
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Grammatical error correction (GEC) is a well-explored problem in English with many existing models and datasets. However, research on GEC in morphologically rich languages has been limited due to challenges such as data scarcity and language complexity. In this paper, we present the first results on Arabic GEC using two newly developed Transformer-based pretrained sequence-to-sequence models. We also define the task of multi-class Arabic grammatical error detection (GED) and present the first results on multi-class Arabic GED. We show that using GED information as auxiliary input in GEC models improves GEC performance across three datasets spanning different genres. Moreover, we also investigate the use of contextual morphological preprocessing in aiding GEC systems. Our models achieve SOTA results on two Arabic GEC shared task datasets and establish a strong benchmark on a recently created dataset. We make our code, data, and pretrained models publicly available.

The User-Aware Arabic Gender Rewriter
Bashar Alhafni | Ossama Obeid | Nizar Habash
Proceedings of the First Workshop on Gender-Inclusive Translation Technologies

We introduce the User-Aware Arabic Gender Rewriter, a user-centric web-based system for Arabic gender rewriting in contexts involving two users. The system takes either Arabic or English sentences as input, and provides users with the ability to specify their desired first and/or second person target genders. The system outputs gender rewritten alternatives of the Arabic sentences (provided directly or as translation outputs) to match the target users’ gender preferences.

2022

Arabic Word-level Readability Visualization for Assisted Text Simplification
Reem Hazim | Hind Saddiki | Bashar Alhafni | Muhamed Al Khalil | Nizar Habash
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

This demo paper presents a Google Docs add-on for automatic Arabic word-level readability visualization. The add-on includes a lemmatization component that is connected to a five-level readability lexicon and Arabic WordNet-based substitution suggestions. The add-on can be used for assessing the reading difficulty of a text and identifying difficult words as part of the task of manual text simplification. We make our add-on and its code publicly available.

CrisisLTLSum: A Benchmark for Local Crisis Event Timeline Extraction and Summarization
Hossein Rajaby Faghihi | Bashar Alhafni | Ke Zhang | Shihao Ran | Joel Tetreault | Alejandro Jaimes
Findings of the Association for Computational Linguistics: EMNLP 2022

Social media has increasingly played a key role in emergency response: first responders can use public posts to better react to ongoing crisis events and deploy the necessary resources where they are most needed. Timeline extraction and abstractive summarization are critical technical tasks to leverage large numbers of social media posts about events. Unfortunately, there are few datasets for benchmarking technical approaches for those tasks. This paper presents , the largest dataset of local crisis event timelines available to date. contains 1,000 crisis event timelines across four domains: wildfires, local fires, traffic, and storms. We built using a semi-automated cluster-then-refine approach to collect data from the public Twitter stream. Our initial experiments indicate a significant gap between the performance of strong baselines compared to the human performance on both tasks.Our dataset, code, and models are publicly available (https://github.com/CrisisLTLSum/CrisisTimelines).

Zero-shot Cross-Linguistic Learning of Event Semantics
Malihe Alikhani | Thomas Kober | Bashar Alhafni | Yue Chen | Mert Inan | Elizabeth Nielsen | Shahab Raji | Mark Steedman | Matthew Stone
Proceedings of the 15th International Conference on Natural Language Generation

The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses
Bashar Alhafni | Nizar Habash | Houda Bouamor
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Gender bias in natural language processing (NLP) applications, particularly machine translation, has been receiving increasing attention. Much of the research on this issue has focused on mitigating gender bias in English NLP models and systems. Addressing the problem in poorly resourced, and/or morphologically rich languages has lagged behind, largely due to the lack of datasets and resources. In this paper, we introduce a new corpus for gender identification and rewriting in contexts involving one or two target users (I and/or You) – first and second grammatical persons with independent grammatical gender preferences. We focus on Arabic, a gender-marking morphologically rich language. The corpus has multiple parallel components: four combinations of 1st and 2nd person in feminine and masculine grammatical genders, as well as English, and English to Arabic machine translation output. This corpus expands on Habash et al. (2019)’s Arabic Parallel Gender Corpus (APGC v1.0) by adding second person targets as well as increasing the total number of sentences over 6.5 times, reaching over 590K words. Our new dataset will aid the research and development of gender identification, controlled text generation, and post-editing rewrite systems that could be used to personalize NLP applications and provide users with the correct outputs based on their grammatical gender preferences. We make the Arabic Parallel Gender Corpus (APGC v2.0) publicly available

User-Centric Gender Rewriting
Bashar Alhafni | Nizar Habash | Houda Bouamor
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In this paper, we define the task of gender rewriting in contexts involving two users (I and/or You) – first and second grammatical persons with independent grammatical gender preferences. We focus on Arabic, a gender-marking morphologically rich language. We develop a multi-step system that combines the positive aspects of both rule-based and neural rewriting models. Our results successfully demonstrate the viability of this approach on a recently created corpus for Arabic gender rewriting, achieving 88.42 M2 F0.5 on a blind test set. Our proposed system improves over previous work on the first-person-only version of this task, by 3.05 absolute increase in M2 F0.5. We demonstrate a use case of our gender rewriting system by using it to post-edit the output of a commercial MT system to provide personalized outputs based on the users’ grammatical gender preferences. We make our code, data, and pretrained models publicly available.

The Shared Task on Gender Rewriting
Bashar Alhafni | Nizar Habash | Houda Bouamor | Ossama Obeid | Sultan Alrowili | Daliyah Alzeer | Khawlah M. Alshanqiti | Ahmed ElBakry | Muhammad ElNokrashy | Mohamed Gabr | Abderrahmane Issam | Abdelrahim Qaddoumi | K. Vijay-Shanker | Mahmoud Zyate
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

In this paper, we present the results and findings of the Shared Task on Gender Rewriting, which was organized as part of the Seventh Arabic Natural Language Processing Workshop. The task of gender rewriting refers to generating alternatives of a given sentence to match different target user gender contexts (e.g., a female speaker with a male listener, a male speaker with a male listener, etc.). This requires changing the grammatical gender (masculine or feminine) of certain words referring to the users. In this task, we focus on Arabic, a gender-marking morphologically rich language. A total of five teams from four countries participated in the shared task.

2021

The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models
Go Inoue | Bashar Alhafni | Nurpeiis Baimukan | Houda Bouamor | Nizar Habash
Proceedings of the Sixth Arabic Natural Language Processing Workshop

In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.

2020

Gender-Aware Reinflection using Linguistically Enhanced Neural Models
Bashar Alhafni | Nizar Habash | Houda Bouamor
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing

In this paper, we present an approach for sentence-level gender reinflection using linguistically enhanced sequence-to-sequence models. Our system takes an Arabic sentence and a given target gender as input and generates a gender-reinflected sentence based on the target gender. We formulate the problem as a user-aware grammatical error correction task and build an encoder-decoder architecture to jointly model reinflection for both masculine and feminine grammatical genders. We also show that adding linguistic features to our model leads to better reinflection results. The results on a blind test set using our best system show improvements over previous work, with a 3.6% absolute increase in M2 F0.5.

CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing
Ossama Obeid | Nasser Zalmout | Salam Khalifa | Dima Taji | Mai Oudah | Bashar Alhafni | Go Inoue | Fadhl Eryani | Alexander Erdmann | Nizar Habash
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present CAMeL Tools, a collection of open-source tools for Arabic natural language processing in Python. CAMeL Tools currently provides utilities for pre-processing, morphological modeling, Dialect Identification, Named Entity Recognition and Sentiment Analysis. In this paper, we describe the design of CAMeL Tools and the functionalities it provides.

Co-authors

Kirill Chirkunov 3

Chatrine Qwaider 3

Ahmed Abdelali 2

Muhamed Al-Khalil 2

Khalid N. Elmadani 2

Salam Khalifa 2

Vivek Kulkarni 2

Ibrahim Abu Farha 1

Asma Al Wazrah 1

Raghad Al-Rasheed 1

Abdulmohsen Al-Thubaity 1

Sarah Al-Towaity 1

Eman Albilali 1

Abdullah Alfaifi 1

Emad A. Alghamdi 1

Sultana Alghurabi 1

Mais Alheraki 1

Muneera Alhoshan 1

Dimitris Alikaniotis 1

Malihe Alikhani 1

Badr Alkhamissi 1

Rawan Almatham 1

Rawan Nasser Almatham 1

Amal Almazrua 1

Khalid Almubarak 1

Abdulrahman M Alosaimy 1

Sultan Alrowili 1

Saied Alshahrani 1

Waad Thuwaini Alshammari 1

Khawlah M. Alshanqiti 1

Areej Alshaqarawi 1

Maryam Alshihri 1

Afrah Altamimi 1

Nora Altwairesh 1

Zaid Alyafeai 1

Norah A. Alzahrani 1

Daliyah Alzeer 1

Atikah Alzeghayer 1

Wissam Antoun 1

Mohamed Anwar 1

Nurpeiis Baimukan 1

Jill Burstein 1

Kareem Mohamed Darwish 1

Abdelrahman Mustafa El-Sheikh 1

Muhammad N. ElNokrashy 1

Ahmed Elbakry 1

Muhammad Elmallah 1

Tamer Elsayed 1

Alexander Erdmann 1

Ramy Eskander 1

Fatima Haouari 1

Maram Hasanain 1

Andrea Horbach 1

Abderrahmane Issam 1

Alejandro Jaimes 1

Christian Khairallah 1

Muhamed Khalil 1

Ekaterina Kochmar 1

Ronja Laarmann-Quante 1

Juan Liberato 1

Hamdy Mubarak 1

Preslav Nakov 1

Fatema Nassar 1

Elizabeth Nielsen 1

Yaser Onaizan 1

Juan David Pineros Liberato 1

Abdelrahim Qaddoumi 1

Hossein Rajaby Faghihi 1

Shady Shehata 1

Mark Steedman 1

Matthew Stone 1

Hanada Taha-Thomure 1

Joel Tetreault 1

Samia Touileb 1

K. Vijay-Shanker 1

Victoria Yaneva 1

Nasser Zalmout 1

Mahmoud Zyate 1

Venues