2025
pdf
bib
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
Mo El-Haj
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
pdf
bib
abs
LENS: Learning Entities from Narratives of Skin Cancer
Daisy Monika Lal
|
Paul Rayson
|
Christopher Peter
|
Ignatius Ezeani
|
Mo El-Haj
|
Yafei Zhu
|
Yufeng Liu
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations
Learning entities from narratives of skin cancer (LENS) is an automatic entity recognition system built on colloquial writings from skin cancer-related Reddit forums. LENS encapsulates a comprehensive set of 24 labels that address clinical, demographic, and psychosocial aspects of skin cancer. Furthermore, we release LENS as a PyPI and pip package, making it easy for developers to download and install, and also provide a web application that allows users to get model predictions interactively, useful for researchers and individuals with minimal programming experience. Additionally, we publish the annotation guidelines designed specifically for spontaneous skin cancer narratives, that can be implemented to better understand and address challenges when developing corpora or systems for similar diseases. The model achieves an overall entity-level F1 score of 0.561, with notable performance for entities such as “CANC_T” (0.747), “STG” (0.788), “POB” (0.714), “GENDER” (0.750), “A/G” (0.714), and “PPL” (0.703). Other entities with significant results include “TRT” (0.625), “MED” (0.606), “AGE” (0.646), “EMO” (0.619), and “MHD” (0.5). We believe that LENS can serve as an essential tool supporting the analysis of patient discussions leading to improvements in the design and development of modern smart healthcare technologies.
pdf
bib
abs
Hindi Reading Comprehension: Do Large Language Models Exhibit Semantic Understanding?
Daisy Monika Lal
|
Paul Rayson
|
Mo El-Haj
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
In this study, we explore the performance of four advanced Generative AI models—GPT-3.5, GPT-4, Llama3, and HindiGPT, for the Hindi reading comprehension task. Using a zero-shot, instruction-based prompting strategy, we assess model responses through a comprehensive triple evaluation framework using the HindiRC dataset. Our framework combines (1) automatic evaluation using ROUGE, BLEU, BLEURT, METEOR, and Cosine Similarity; (2) rating-based assessments focussing on correctness, comprehension depth, and informativeness; and (3) preference-based selection to identify the best responses. Human ratings indicate that GPT-4 outperforms the other LLMs on all parameters, followed by HindiGPT, GPT-3.5, and then Llama3. Preference-based evaluation similarly placed GPT-4 (80%) as the best model, followed by HindiGPT(74%). However, automatic evaluation showed GPT-4 to be the lowest performer on n-gram metrics, yet the best performer on semantic metrics, suggesting it captures deeper meaning and semantic alignment over direct lexical overlap, which aligns with its strong human evaluation scores. This study also highlights that even though the models mostly address literal factual recall questions with high precision, they still face the challenge of specificity and interpretive bias at times.
pdf
bib
Proceedings of the first International Workshop on Nakba Narratives as Language Resources
Mustafa Jarrar
|
Habash Habash
|
Mo El-Haj
Proceedings of the first International Workshop on Nakba Narratives as Language Resources
pdf
bib
abs
The Nakba Lexicon: Building a Comprehensive Dataset from Palestinian Literature
Izza AbuHaija
|
Salim Al Mandhari
|
Mo El-Haj
|
Jonas Sibony
|
Paul Rayson
Proceedings of the first International Workshop on Nakba Narratives as Language Resources
This paper introduces the Nakba Lexicon, a comprehensive dataset derived from the poetry collection Asifa ‘Ala al-Iz‘aj (Sorry for the Disturbance) by Istiqlal Eid, a Palestinian poet from El-Birweh. Eid’s work poignantly reflects on themes of Palestinian identity, displacement, and resilience, serving as a resource for preserving linguistic and cultural heritage in the context of post-Nakba literature. The dataset is structured into ten thematic domains, including political terminology, memory and preservation, sensory and emotional lexicon, toponyms, nature, and external linguistic influences such as Hebrew, French, and English, thereby capturing the socio-political, emotional, and cultural dimensions of the Nakba. The Nakba Lexicon uniquely emphasises the contributions of women to Palestinian literary traditions, shedding light on often-overlooked narratives of resilience and cultural continuity. Advanced Natural Language Processing (NLP) techniques were employed to analyse the dataset, with fine-tuned pre-trained models such as ARABERT and MARBERT achieving F1-scores of 0.87 and 0.68 in language and lexical classification tasks, respectively, significantly outperforming traditional machine learning models. These results highlight the potential of domain-specific computational models to effectively analyse complex datasets, facilitating the preservation of marginalised voices. By bridging computational methods with cultural preservation, this study enhances the understanding of Palestinian linguistic heritage and contributes to broader efforts in documenting and analysing endangered narratives. The Nakba Lexicon paves the way for future interdisciplinary research, showcasing the role of NLP in addressing historical trauma, resilience, and cultural identity.
pdf
bib
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)
Saad Ezzini
|
Hamza Alami
|
Ismail Berrada
|
Abdessamad Benlahbib
|
Abdelkader El Mahdaouy
|
Salima Lamsiyah
|
Hatim Derrouz
|
Amal Haddad Haddad
|
Mustafa Jarrar
|
Mo El-Haj
|
Ruslan Mitkov
|
Paul Rayson
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)
2024
pdf
bib
abs
AraFinNLP 2024: The First Arabic Financial NLP Shared Task
Sanad Malaysha
|
Mo El-Haj
|
Saad Ezzini
|
Mohammed Khalilia
|
Mustafa Jarrar
|
Sultan Almujaiwel
|
Ismail Berrada
|
Houda Bouamor
Proceedings of The Second Arabic Natural Language Processing Conference
The expanding financial markets of the Arab world require sophisticated Arabic NLP tools. To address this need within the banking domain, the Arabic Financial NLP (AraFinNLP) shared task proposes two subtasks: (i) Multi-dialect Intent Detection and (ii) Cross-dialect Translation and Intent Preservation. This shared task uses the updated ArBanking77 dataset, which includes about 39k parallel queries in MSA and four dialects. Each query is labeled with one or more of a common 77 intents in the banking domain. These resources aim to foster the development of robust financial Arabic NLP, particularly in the areas of machine translation and banking chat-bots.A total of 45 unique teams registered for this shared task, with 11 of them actively participated in the test phase. Specifically, 11 teams participated in Subtask 1, while only 1 team participated in Subtask 2. The winning team of Subtask 1 achieved F1 score of 0.8773, and the only team submitted in Subtask 2 achieved a 1.667 BLEU score.
pdf
bib
abs
DARES: Dataset for Arabic Readability Estimation of School Materials
Mo El-Haj
|
Sultan Almujaiwel
|
Damith Premasiri
|
Tharindu Ranasinghe
|
Ruslan Mitkov
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024
This research introduces DARES, a dataset for assessing the readability of Arabic text in Saudi school materials. DARES compromise of 13335 instances from textbooks used in 2021 and contains two subtasks; (a) Coarse-grained readability assessment where the text is classified into different educational levels such as primary and secondary. (b) Fine-grained readability assessment where the text is classified into individual grades.. We fine-tuned five transformer models that support Arabic and found that CAMeLBERTmix performed the best in all input settings. Evaluation results showed high performance for the coarse-grained readability assessment task, achieving a weighted F1 score of 0.91 and a macro F1 score of 0.79. The fine-grained task achieved a weighted F1 score of 0.68 and a macro F1 score of 0.55. These findings demonstrate the potential of our approach for advancing Arabic text readability assessment in education, with implications for future innovations in the field.
pdf
bib
abs
The Multilingual Corpus of World’s Constitutions (MCWC)
Mo El-Haj
|
Saad Ezzini
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
The “Multilingual Corpus of World’s Constitutions” (MCWC) serves as a valuable resource for the NLP community, offering a comprehensive collection of constitutions from around the world. Its focus on data quality and breadth of coverage enables advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. The MCWC prepares its data to ensure high quality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments across languages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on the MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. The MCWC’s rich multilingual content and rigorous data quality standards raise the bar for legal text analysis and inspire innovation in the NLP community, opening new avenues for studying constitutional texts and multilingual data analysis.
2023
pdf
bib
Unifying Emotion Analysis Datasets using Valence Arousal Dominance (VAD)
Mo El-Haj
|
Ryutaro Takanami
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
bib
FinAraT5: A text to text model for financial Arabic text understanding and generation
Nadhem Zmandar
|
Mo El-Haj
|
Paul Rayson
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
bib
Open-Source Thesaurus Development for Under-Resourced Languages: a Welsh Case Study
Nouran Khallaf
|
Elin Arfon
|
Mo El-Haj
|
Jonathan Morris
|
Dawn Knight
|
Paul Rayson
|
Tymaa Hasanain Hammouda
|
Mustafa Jarrar
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
bib
abs
Exploring Abstractive Text Summarisation for Podcasts: A Comparative Study of BART and T5 Models
Parth Saxena
|
Mo El-Haj
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Podcasts have become increasingly popular in recent years, resulting in a massive amount of audio content being produced every day. Efficient summarisation of podcast episodes can enable better content management and discovery for users. In this paper, we explore the use of abstractive text summarisation methods to generate high-quality summaries of podcast episodes. We use pre-trained models, BART and T5, to fine-tune on a dataset of Spotify’s 100K podcast. We evaluate our models using automated metrics and human evaluation, and find that the BART model fine-tuned on the podcast dataset achieved a higher ROUGE-1 and ROUGE-L score compared to other models, while the T5 model performed better in terms of semantic meaning. The human evaluation indicates that both models produced high-quality summaries that were well received by participants. Our study demonstrates the effectiveness of abstractive summarisation methods for podcast episodes and offers insights for improving the summarisation of audio content.