Constructing a Bilingual Hadith Corpus Using a Segmentation Tool

Shatha Altammami, Eric Atwell, Ammar Alsalka


Abstract
This article describes the process of gathering and constructing a bilingual parallel corpus of Islamic Hadith, which is the set of narratives reporting different aspects of the prophet Muhammad’s life. The corpus data is gathered from the six canonical Hadith collections using a custom segmentation tool that automatically segments and annotates the two Hadith components with 92% accuracy. This Hadith segmenter minimises the costs of language resource creation and produces consistent results independently from previous knowledge and experiences that usually influence human annotators. The corpus includes more than 10M tokens and will be freely available via the LREC repository.
Anthology ID:
2020.lrec-1.415
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3390–3398
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.415
DOI:
Bibkey:
Cite (ACL):
Shatha Altammami, Eric Atwell, and Ammar Alsalka. 2020. Constructing a Bilingual Hadith Corpus Using a Segmentation Tool. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3390–3398, Marseille, France. European Language Resources Association.
Cite (Informal):
Constructing a Bilingual Hadith Corpus Using a Segmentation Tool (Altammami et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.415.pdf