Hend Al-Khalifa

2026

From Code-Centric to Concept-Centric: Teaching NLP with LLM-Assisted "Vibe Coding”
Hend Al-Khalifa
Proceedings of the Seventh Workshop on Teaching Natural Language Processing (TeachNLP 2026)

The rapid advancement of Large Language Models (LLMs) presents both challenges and opportunities for Natural Language Processing (NLP) education. This paper introduces “Vibe Coding,” a pedagogical approach that leverages LLMs as coding assistants while maintaining focus on conceptual understanding and critical thinking. We describe the implementation of this approach in a senior-level undergraduate NLP course, where students completed seven labs using LLMs for code generation while being assessed primarily on conceptual understanding through critical reflection questions. Analysis of end-of-course feedback from 19 students reveals high satisfaction (mean scores 4.4-4.6/5.0) across engagement, conceptual learning, and assessment fairness. Students particularly valued the reduced cognitive load from debugging, enabling deeper focus on NLP concepts. However, challenges emerged around time constraints, LLM output verification, and the need for clearer task specifications. Our findings suggest that when properly structured with mandatory prompt logging and reflection-based assessment, LLM-assisted learning can shift focus from syntactic fluency to conceptual mastery, preparing students for an AI-augmented professional landscape.

2025

pdf bib abs

iWAN-NLP at AHaSIS 2025: A Stacked Ensemble of Arabic Transformers for Sentiment Analysis on Arabic Dialects in the Hospitality Domain
Hend Al-Khalifa
Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects

This paper details the iWAN-NLP system developed for participation in the AHaSIS 2025 shared task, “Sentiment Analysis on Arabic Dialects in the Hospitality Domain: A Multi-Dialect Benchmark.” Our approach leverages a multi-model ensemble strategy, combining the strengths of MARBERTv2, Saudibert, and DarijaBERT. These pre-trained Arabic language models were fine-tuned for sentiment classification using a 5-fold stratified cross-validation methodology. The final predictions on the test set were derived by averaging the logits produced by each model across all folds and then averaging these combined logits across the three models. This system achieved a macro F1-score of 81.0% on the official evaluation dataset and a cross-validated macro F1-score of 0.8513 (accuracy 0.8628) on the training set. Our findings highlight the effectiveness of ensembling regionally adapted models and robust cross-validation for Arabic sentiment analysis in the hospitality domain, ultimately securing first place in the AHaSIS 2025 shared task.

pdf bib

Proceedings of the 31st International Conference on Computational Linguistics
Owen Rambow | Leo Wanner | Marianna Apidianaki | Hend Al-Khalifa | Barbara Di Eugenio | Steven Schockaert
Proceedings of the 31st International Conference on Computational Linguistics

pdf bib

pdf bib

2024

pdf bib abs

CLEANANERCorp: Identifying and Correcting Incorrect Labels in the ANERcorp Dataset
Mashael AlDuwais | Hend Al-Khalifa | Abdulmalik AlSalman
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

Label errors are a common issue in machine learning datasets, particularly for tasks such as Named Entity Recognition. Such label erros might hurt model training, affect evaluation results, and lead to an inaccurate assessment of model performance. In this study, we dived deep into one of the widely adopted Arabic NER benchmark datasets (ANERcorp) and found a significant number of annotation errors, missing labels, and inconsistencies. Therefore, in this study, we conducted empirical research to understand these erros, correct them and propose a cleaner version of the dataset named CLEANANERCorp. CLEANANERCorp will serve the research community as a more accurate and consistent benchmark.

pdf bib abs

Meta-Evaluation of Sentence Simplification Metrics
Noof Abdullah Alfear | Dimitar Kazakov | Hend Al-Khalifa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Automatic Text Simplification (ATS) is one of the major Natural Language Processing (NLP) tasks, which aims to help people understand text that is above their reading abilities and comprehension. ATS models reconstruct the text into a simpler format by deletion, substitution, addition or splitting, while preserving the original meaning and maintaining correct grammar. Simplified sentences are usually evaluated by human experts based on three main factors: simplicity, adequacy and fluency or by calculating automatic evaluation metrics. In this paper, we conduct a meta-evaluation of reference-based automatic metrics for English sentence simplification using high-quality, human-annotated dataset, NEWSELA-LIKERT. We study the behavior of several evaluation metrics at sentence level across four different sentence simplification models. All the models were trained on the NEWSELA-AUTO dataset. The correlation between the metrics’ scores and human judgements was analyzed and the results used to recommend the most appropriate metrics for this task.

pdf bib

Analyzing Politeness in Arabic Tweets: A Preliminary Study
Hend Al-Khalifa | Nadia Ghezaiel | Maria Bounnit
Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)

pdf bib abs

A Novel Approach for Root Selection in the Dependency Parsing
Sharefah Ahmed Al-Ghamdi | Hend Al-Khalifa | Abdulmalik AlSalman
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

Although syntactic analysis using the sequence labeling method is promising, it can be problematic when the labels sequence does not contain a root label. This can result in errors in the final parse tree when the postprocessing method assumes the first word as the root. In this paper, we present a novel postprocessing method for BERT-based dependency parsing as sequence labeling. Our method leverages the root’s part of speech tag to select a more suitable root for the dependency tree, instead of using the default first token. We conducted experiments on nine dependency treebanks from different languages and domains, and demonstrated that our technique consistently improves the labeled attachment score (LAS) on most of them.

pdf bib abs

Halwasa: Quantify and Analyze Hallucinations in Large Language Models: Arabic as a Case Study
Hamdy Mubarak | Hend Al-Khalifa | Khaloud Suliman Alkhalefah
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large Language Models (LLMs) have shown superb abilities to generate texts that are indistinguishable from human-generated texts in many cases. However, sometimes they generate false, incorrect, or misleading content, which is often described as “hallucinations”. Quantifying and analyzing hallucination in LLMs can increase their reliability and usage. While hallucination is being actively studied for English and other languages, and different benchmarking datsets have been created, this area is not studied at all for Arabic. In our paper, we create the first Arabic dataset that contains 10K of generated sentences by LLMs and annotate it for factuality and correctness. We provide detailed analysis of the dataset to analyze factual and linguistic errors. We found that 25% of the generated sentences are factually incorrect. We share the dataset with the research community.

pdf bib

The Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic
Shahad Al-Khalifa | Hend Al-Khalifa
Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)

pdf bib

Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
Hend Al-Khalifa | Kareem Darwish | Hamdy Mubarak | Mona Ali | Tamer Elsayed
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

pdf bib abs

Arabic Speech Recognition of zero-resourced Languages: A case of Shehri (Jibbali) Language
Norah A. Alrashoudi | Omar Said Alshahri | Hend Al-Khalifa
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

Many under-resourced languages lack computational resources for automatic speech recognition (ASR) due to data scarcity issues. This makes developing accurate ASR models challenging. Shehri or Jibbali, spoken in Oman, lacks extensive annotated speech data. This paper aims to improve an ASR model for this under-resourced language. We collected a Shehri (Jibbali) speech corpus and utilized transfer learning by fine-tuning pre-trained ASR models on this dataset. Specifically, models like Wav2Vec2.0, HuBERT and Whisper were fine-tuned using techniques like parameter-efficient fine-tuning. Evaluation using word error rate (WER) and character error rate (CER) showed that the Whisper model, fine-tuned on the Shehri (Jibbali) dataset, significantly outperformed other models, with the best results from Whisper-medium achieving 3.5% WER. This demonstrates the effectiveness of transfer learning for resource-constrained tasks, showing high zero-shot performance of pre-trained models.

2022

pdf bib

pdf bib

Customer Sentiments Toward Saudi Banks During the Covid-19 Pandemic
Dhuha Alqahtani | Lama Alzahrani | Maram Bahareth | Nora Alshameri | Hend Al-Khalifa | Luluh Aldhubayi
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)

pdf bib

Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection
Hend Al-Khalifa | Tamer Elsayed | Hamdy Mubarak | Abdulmohsen Al-Thubaity | Walid Magdy | Kareem Darwish
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

pdf bib abs

Overview of OSACT5 Shared Task on Arabic Offensive Language and Hate Speech Detection
Hamdy Mubarak | Hend Al-Khalifa | Abdulmohsen Al-Thubaity
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

This paper provides an overview of the shard task on detecting offensive language, hate speech, and fine-grained hate speech at the fifth workshop on Open-Source Arabic Corpora and Processing Tools (OSACT5). The shared task comprised of three subtasks; Subtask A, involving the detection of offensive language, which contains socially unacceptable or impolite content including any kind of explicit or implicit insults or attacks against individuals or groups; Subtask B, involving the detection of hate speech, which contains offensive language targeting individuals or groups based on common characteristics such as race, religion, gender, etc.; and Subtask C, involving the detection of the fine-grained type of hate speech which takes one value from the following types: (i) race/ethnicity/nationality, (ii) religion/belief, (iii) ideology, (iv) disability/disease, (v) social class, and (vi) gender. In total, 40 teams signed up to participate in Subtask A, and 17 of them submitted test runs. For Subtask B, 26 teams signed up to participate and 12 of them submitted runs. And for Subtask C, 23 teams signed up to participate and 10 of them submitted runs. 10 teams submitted papers describing their participation in one subtask or more, and 8 papers were accepted. We present and analyze all submissions in this paper.

pdf bib abs

Establishing a Baseline for Arabic Patents Classification: A Comparison of Twelve Approaches
Taif Omar Al-Omar | Hend Al-Khalifa | Rawan Al-Matham
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Nowadays, the number of patent applications is constantly growing and there is an economical interest on developing accurate and fast models to automate their classification task. In this paper, we introduce the first public Arabic patent dataset called ArPatent and experiment with twelve classification approaches to develop a baseline for Arabic patents classification. To achieve the goal of finding the best baseline for classifying Arabic patents, different machine learning, pre-trained language models as well as ensemble approaches were conducted. From the obtained results, we can observe that the best performing model for classifying Arabic patents was ARBERT with F1 of 66.53%, while the ensemble approach of the best three performing language models, namely: ARBERT, CAMeL-MSA, and QARiB, achieved the second best F1 score, i.e., 64.52%.

pdf bib abs

Sa‘7r: A Saudi Dialect Irony Dataset
Halah AlMazrua | Najla AlHazzani | Amaal AlDawod | Lama AlAwlaqi | Noura AlReshoudi | Hend Al-Khalifa | Luluh AlDhubayi
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

In sentiment analysis, detecting irony is considered a major challenge. The key problem with detecting irony is the difficulty to recognize the implicit and indirect phrases which signifies the opposite meaning. In this paper, we present Sa‘7r ساخرthe Saudi irony dataset, and describe our efforts in constructing it. The dataset was collected using Twitter API and it consists of 19,810 tweets, 8,089 of them are labeled as ironic tweets. We trained several models for irony detection task using machine learning models and deep learning models. The machine learning models include: K-Nearest Neighbor (KNN), Logistic Regression (LR), Support Vector Machine (SVM), and Naïve Bayes (NB). While the deep learning models include BiLSTM and AraBERT. The detection results show that among the tested machine learning models, the SVM outperformed other classifiers with an accuracy of 0.68. On the other hand, the deep learning models achieved an accuracy of 0.66 in the BiLSTM model and 0.71 in the AraBERT model. Thus, the AraBERT model achieved the most accurate result in detecting irony phrases in Saudi Dialect.

pdf bib abs

Assessing the Linguistic Knowledge in Arabic Pre-trained Language Models Using Minimal Pairs
Wafa Abdullah Alrajhi | Hend Al-Khalifa | Abdulmalik AlSalman
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Despite the noticeable progress that we recently witnessed in Arabic pre-trained language models (PLMs), the linguistic knowledge captured by these models remains unclear. In this paper, we conducted a study to evaluate available Arabic PLMs in terms of their linguistic knowledge. BERT-based language models (LMs) are evaluated using Minimum Pairs (MP), where each pair represents a grammatical sentence and its contradictory counterpart. MPs isolate specific linguistic knowledge to test the model’s sensitivity in understanding a specific linguistic phenomenon. We cover nine major Arabic phenomena: Verbal sentences, Nominal sentences, Adjective Modification, and Idafa construction. The experiments compared the results of fifteen Arabic BERT-based PLMs. Overall, among all tested models, CAMeL-CA outperformed the other PLMs by achieving the highest overall accuracy.

2021

pdf bib abs

Sarcasm and Sentiment Detection In Arabic Tweets Using BERT-based Models and Data Augmentation
Abeer Abuzayed | Hend Al-Khalifa
Proceedings of the Sixth Arabic Natural Language Processing Workshop

In this paper, we describe our efforts on the shared task of sarcasm and sentiment detection in Arabic (Abu Farha et al., 2021). The shared task consists of two sub-tasks: Sarcasm Detection (Subtask 1) and Sentiment Analysis (Subtask 2). Our experiments were based on fine-tuning seven BERT-based models with data augmentation to solve the imbalanced data problem. For both tasks, the MARBERT BERT-based model with data augmentation outperformed other models with an increase of the F-score by 15% for both tasks which shows the effectiveness of our approach.

pdf bib

A Dependency Treebank for Classical Arabic Poetry
Sharefah Al-Ghamdi | Hend Al-Khalifa | Abdulmalik Al-Salman
Proceedings of the Sixth International Conference on Dependency Linguistics (Depling, SyntaxFest 2021)

2020

pdf bib

Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection
Hend Al-Khalifa | Walid Magdy | Kareem Darwish | Tamer Elsayed | Hamdy Mubarak
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

pdf bib abs

Overview of OSACT4 Arabic Offensive Language Detection Shared Task
Hamdy Mubarak | Kareem Darwish | Walid Magdy | Tamer Elsayed | Hend Al-Khalifa
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

This paper provides an overview of the offensive language detection shared task at the 4th workshop on Open-Source Arabic Corpora and Processing Tools (OSACT4). There were two subtasks, namely: Subtask A, involving the detection of offensive language, which contains unacceptable or vulgar content in addition to any kind of explicit or implicit insults or attacks against individuals or groups; and Subtask B, involving the detection of hate speech, which contains insults or threats targeting a group based on their nationality, ethnicity, race, gender, political or sport affiliation, religious belief, or other common characteristics. In total, 40 teams signed up to participate in Subtask A, and 14 of them submitted test runs. For Subtask B, 33 teams signed up to participate and 13 of them submitted runs. We present and analyze all submissions in this paper.

pdf bib abs

Hate Speech Detection in Saudi Twittersphere: A Deep Learning Approach
Raghad Alshaalan | Hend Al-Khalifa
Proceedings of the Fifth Arabic Natural Language Processing Workshop

With the rise of hate speech phenomena in Twittersphere, significant research efforts have been undertaken to provide automatic solutions for detecting hate speech, varying from simple ma-chine learning models to more complex deep neural network models. Despite that, research works investigating hate speech problem in Arabic are still limited. This paper, therefore, aims to investigate several neural network models based on Convolutional Neural Network (CNN) and Recurrent Neural Networks (RNN) to detect hate speech in Arabic tweets. It also evaluates the recent language representation model BERT on the task of Arabic hate speech detection. To conduct our experiments, we firstly built a new hate speech dataset that contains 9,316 annotated tweets. Then, we conducted a set of experiments on two datasets to evaluate four models: CNN, GRU, CNN+GRU and BERT. Our experimental results on our dataset and an out-domain dataset show that CNN model gives the best performance with an F1-score of 0.79 and AUROC of 0.89.

2017

pdf bib

2016

pdf bib abs

This paper introduces MADAD, a general-purpose annotation tool for Arabic text with focus on readability annotation. This tool will help in overcoming the problem of lack of Arabic readability training data by providing an online environment to collect readability assessments on various kinds of corpora. Also the tool supports a broad range of annotation tasks for various linguistic and semantic phenomena by allowing users to create their customized annotation schemes. MADAD is a web-based tool, accessible through any web browser; the main features that distinguish MADAD are its flexibility, portability, customizability and its bilingual interface (Arabic/English).

pdf bib

AraSenTi: Large-Scale Twitter-Specific Arabic Sentiment Lexicons
Nora Al-Twairesh | Hend Al-Khalifa | Abdulmalik Al-Salman
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Hend Al-Khalifa

2026

2025

2024

2022

2021

2020

2017

2016

2015

Co-authors

Venues