MsBERT: A New Model for the Reconstruction of Lacunae in Hebrew Manuscripts

Avi Shmidman, Ometz Shmidman, Hillel Gershuni, Moshe Koppel


Abstract
Hebrew manuscripts preserve thousands of textual transmissions of post-Biblical Hebrew texts from the first millennium. In many cases, the text in the manuscripts is not fully decipherable, whether due to deterioration, perforation, burns, or otherwise. Existing BERT models for Hebrew struggle to fill these gaps, due to the many orthographical deviations found in Hebrew manuscripts. We have pretrained a new dedicated BERT model, dubbed MsBERT (short for: Manuscript BERT), designed from the ground up to handle Hebrew manuscript text. MsBERT substantially outperforms all existing Hebrew BERT models regarding the prediction of missing words in fragmentary Hebrew manuscript transcriptions in multiple genres, as well as regarding the task of differentiating between quoted passages and exegetical elaborations. We provide MsBERT for free download and unrestricted use, and we also provide an interactive and user-friendly website to allow manuscripts scholars to leverage the power of MsBERT in their scholarly work of reconstructing fragmentary Hebrew manuscripts.
Anthology ID:
2024.ml4al-1.2
Volume:
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
Month:
August
Year:
2024
Address:
Hybrid in Bangkok, Thailand and online
Editors:
John Pavlopoulos, Thea Sommerschield, Yannis Assael, Shai Gordin, Kyunghyun Cho, Marco Passarotti, Rachele Sprugnoli, Yudong Liu, Bin Li, Adam Anderson
Venues:
ML4AL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13–18
Language:
URL:
https://aclanthology.org/2024.ml4al-1.2
DOI:
Bibkey:
Cite (ACL):
Avi Shmidman, Ometz Shmidman, Hillel Gershuni, and Moshe Koppel. 2024. MsBERT: A New Model for the Reconstruction of Lacunae in Hebrew Manuscripts. In Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024), pages 13–18, Hybrid in Bangkok, Thailand and online. Association for Computational Linguistics.
Cite (Informal):
MsBERT: A New Model for the Reconstruction of Lacunae in Hebrew Manuscripts (Shmidman et al., ML4AL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.ml4al-1.2.pdf