A Classifier of Word-Level Variants in Witnesses of Biblical Hebrew Manuscripts

Iglika Nikolova-Stoupak; Maxime Amblard; Sophie Robert-Hayek; Frédérique Rey

doi:10.18653/v1/2025.findings-acl.1098

A Classifier of Word-Level Variants in Witnesses of Biblical Hebrew Manuscripts

Iglika Nikolova-Stoupak, Maxime Amblard, Sophie Robert-Hayek, Frédérique Rey

Abstract

The current project is inscribed within the field of stemmatology or the study and/or reconstruction of textual transmission based on the relationship between the available witnesses of given texts. In particular, the variants (differences) at the word-level in manuscripts written in Biblical Hebrew are considered. A strong classifier (F1 value of 0.80) is trained to predict the category of difference between word pairs (‘plus/minus’, ‘inversion’, ‘morphological’, ‘lexical’ or ‘unclassifiable’) as present in collated (aligned) pairs of witnesses. The classifier is non-neural and makes use of the two words themselves as well as part-of-speech (POS) tags, hand-crafted rules per category and synthetically derived data. Other models experimented with include neural ones based on the state-of-the-art model for Modern Hebrew, DictaBERT. Other features whose relevance is tested are different types of morphological information pertaining to the word pairs and the Levenshtein distance between words. A selection of the strongest classifiers as well as the used synthetic data and the steps taken at its derivation are made available. Coincidentally, the corelation between two sets of morphological labels is investigated: professionally established as per the Qumran-Digital online library and automatically derived with the sub-model DictaBERT-morph.

Anthology ID:: 2025.findings-acl.1098
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21313–21329
Language:
URL:: https://aclanthology.org/2025.findings-acl.1098/
DOI:: 10.18653/v1/2025.findings-acl.1098
Bibkey:
Cite (ACL):: Iglika Nikolova-Stoupak, Maxime Amblard, Sophie Robert-Hayek, and Frédérique Rey. 2025. A Classifier of Word-Level Variants in Witnesses of Biblical Hebrew Manuscripts. In Findings of the Association for Computational Linguistics: ACL 2025, pages 21313–21329, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: A Classifier of Word-Level Variants in Witnesses of Biblical Hebrew Manuscripts (Nikolova-Stoupak et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.1098.pdf

PDF Cite Search Fix data