Classical Arabic, like all other historical languages, lacks adequate training datasets and accurate “off-the-shelf” models that can be directly employed in the processing pipelines. In this paper, we present our in-progress work in developing and training deep learning models tailored for handling diverse tasks relevant to classical Arabic texts. Specifically, we focus on Named Entities Recognition, person relationships classification, toponym sub-classification, onomastic section boundaries detection, onomastic entities classification, as well as date recognition and classification. Our work aims to address the challenges associated with these tasks and provide effective solutions for analyzing classical Arabic texts. Although this work is still in progress, the preliminary results reported in the paper indicate excellent to satisfactory performance of the fine-tuned models, effectively meeting the intended goal for which they were trained.
Shamela: A Large-Scale Historical Arabic Corpus
Yonatan Belinkov | Alexander Magidow | Maxim Romanov | Avi Shmidman | Moshe Koppel
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)
Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.