Shatha Altammami


2025

pdf bib
Leveraging AI to Bridge Classical Arabic and Modern Standard Arabic for Text Simplification
Shatha Altammami
Proceedings of the New Horizons in Computational Linguistics for Religious Texts

This paper introduces the Hadith Simplification Dataset, a novel resource comprising 250 pairs of Classical Arabic (CA) Hadith texts and their simplified Modern Standard Arabic (MSA) equivalents. Addressing the lack of resources for simplifying culturally and religiously significant texts, this dataset bridges linguistic and accessibility gaps while preserving theological integrity. The simplifications were generated using a large language model and rigorously verified by an Islamic Studies expert to ensure precision and cultural sensitivity. By tackling the unique lexical, syntactic, and cultural challenges of CA-to-MSA transformation, this resource advances Arabic text simplification research. Beyond religious texts, the methodology developed is adaptable to other domains, such as poetry and historical literature. This work underscores the importance of ethical AI applications in preserving the integrity of religious texts while enhancing their accessibility to modern audiences.

2022

pdf bib
Challenging the Transformer-based models with a Classical Arabic dataset: Quran and Hadith
Shatha Altammami | Eric Atwell
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Transformer-based models showed near-perfect results on several downstream tasks. However, their performance on classical Arabic texts is largely unexplored. To fill this gap, we evaluate monolingual, bilingual, and multilingual state-of-the-art models to detect relatedness between the Quran (Muslim holy book) and the Hadith (Prophet Muhammed teachings), which are complex classical Arabic texts with underlying meanings that require deep human understanding. To do this, we carefully built a dataset of Quran-verse and Hadith-teaching pairs by consulting sources of reputable religious experts. This study presents the methodology of creating the dataset, which we make available on our repository, and discusses the models’ performance that calls for the imminent need to explore avenues for improving the quality of these models to capture the semantics in such complex, low-resource texts.

2020

pdf bib
Constructing a Bilingual Hadith Corpus Using a Segmentation Tool
Shatha Altammami | Eric Atwell | Ammar Alsalka
Proceedings of the Twelfth Language Resources and Evaluation Conference

This article describes the process of gathering and constructing a bilingual parallel corpus of Islamic Hadith, which is the set of narratives reporting different aspects of the prophet Muhammad’s life. The corpus data is gathered from the six canonical Hadith collections using a custom segmentation tool that automatically segments and annotates the two Hadith components with 92% accuracy. This Hadith segmenter minimises the costs of language resource creation and produces consistent results independently from previous knowledge and experiences that usually influence human annotators. The corpus includes more than 10M tokens and will be freely available via the LREC repository.

2019

pdf bib
Text Segmentation Using N-grams to Annotate Hadith Corpus
Shatha Altammami | Eric Atwell | Ammar Alsalka
Proceedings of the 3rd Workshop on Arabic Corpus Linguistics