A Transformer-based parser for Syriac morphology

Martijn Naaijer, Constantijn Sikkel, Mathias Coeckelbergs, Jisk Attema, Willem Th. Van Peursen


Abstract
In this project we train a Transformer-based model from scratch, with the goal of parsing the morphology of Ancient Syriac texts as accurately as possible. Syriac is still a low resource language, only a relatively small training set was available. Therefore, the training set was expanded by adding Biblical Hebrew data to it. Five different experiments were done: the model was trained on Syriac data only, it was trained with mixed Syriac and (un)vocalized Hebrew data, and it was pretrained on (un)vocalized Hebrew data and then finetuned on Syriac data. The models trained on Hebrew and Syriac data consistently outperform the models trained on Syriac data only. This shows, that the differences between Syriac and Hebrew are small enough that it is worth adding Hebrew data to train the model for parsing Syriac morphology. Training models on different languages is an important trend in NLP, we show that this works well for relatively small datasets of Syriac and Hebrew.
Anthology ID:
2023.alp-1.3
Volume:
Proceedings of the Ancient Language Processing Workshop
Month:
September
Year:
2023
Address:
Varna, Bulgaria
Editors:
Adam Anderson, Shai Gordin, Bin Li, Yudong Liu, Marco C. Passarotti
Venues:
ALP | WS
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
23–29
Language:
URL:
https://aclanthology.org/2023.alp-1.3
DOI:
Bibkey:
Cite (ACL):
Martijn Naaijer, Constantijn Sikkel, Mathias Coeckelbergs, Jisk Attema, and Willem Th. Van Peursen. 2023. A Transformer-based parser for Syriac morphology. In Proceedings of the Ancient Language Processing Workshop, pages 23–29, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
A Transformer-based parser for Syriac morphology (Naaijer et al., ALP-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.alp-1.3.pdf