Mathias Coeckelbergs
2023
A Transformer-based parser for Syriac morphology
Martijn Naaijer
|
Constantijn Sikkel
|
Mathias Coeckelbergs
|
Jisk Attema
|
Willem Th. Van Peursen
Proceedings of the Ancient Language Processing Workshop
In this project we train a Transformer-based model from scratch, with the goal of parsing the morphology of Ancient Syriac texts as accurately as possible. Syriac is still a low resource language, only a relatively small training set was available. Therefore, the training set was expanded by adding Biblical Hebrew data to it. Five different experiments were done: the model was trained on Syriac data only, it was trained with mixed Syriac and (un)vocalized Hebrew data, and it was pretrained on (un)vocalized Hebrew data and then finetuned on Syriac data. The models trained on Hebrew and Syriac data consistently outperform the models trained on Syriac data only. This shows, that the differences between Syriac and Hebrew are small enough that it is worth adding Hebrew data to train the model for parsing Syriac morphology. Training models on different languages is an important trend in NLP, we show that this works well for relatively small datasets of Syriac and Hebrew.
2022
From Pattern to Interpretation. Using Colibri Core to Detect Translation Patterns in the Peshitta.
Mathias Coeckelbergs
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This article presents the first results of the CLARIAH-funded project ‘Patterns in Translation: Using Colibri Core for the Syriac Bible’ (PaTraCoSy). This project seeks to use Colibri Core to detect translation patterns in the Peshitta, the Syriac translation of the Hebrew Bible. We first describe how we constructed word and phrase alignment between these two texts. This step is necessary to succesfully implement the functionalities of Colibri Core. After this, we further describe our first investigations with the software. We describe how we use the built-in pattern modeller to detect n-gram and skipgram patterns in both Hebrew and Syriac texts. Colibri Core does not allow the creation of a bilingual model, which is why we compare the separate models. After a presentation of a few general insights on the overall translation behaviour of the Peshitta, we delve deeper into the concrete patterns we can detect by the n-gram/skipgram analysis. We provide multiple examples from the book of Genesis, a book which has been treated broadly in scholarly research into the Syriac translation, but which also appears to have interesting features based on our Colibri Core research.